Sound Source Localization is All About Cross-Modal Alignment
声源定位就是跨模态对齐
Class-Incremental Grouping Network for Continual Audio-Visual Learning
用于持续视听学习的班级增量分组网络
Audio-Visual Class-Incremental Learning
视听课堂-增量学习
DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-Guided Speaker Embedding
DiffV2S:具有视觉引导扬声器嵌入的基于扩散的视频语音合成
The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion
声音的力量 (TPoS):具有稳定扩散的音频反应视频生成
On the Audio-Visual Synchronization for Lip-to-Speech Synthesis
唇语合成的视听同步研究
Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation
通过对齐的跨模态蒸馏进行密集 2D-3D 室内声音预测
Hyperbolic Audio-Visual Zero-Shot Learning
双曲视听零样本学习
AdVerb: Visually Guided Audio Dereverberation
AdVerb:视觉引导音频去混响
Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation
运动声音定位:联合学习声音方向和相机旋转