← Back to Paper List

MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation

Mingzhen Sun, Weining Wang, Yanyuan Qiao, Jiahui Sun, Zihan Qin, Longteng Guo, Xinxin Zhu, Jing Liu
Institute of Automation, Chinese Academy of Science, School of Artificial Intelligence, University of Chinese Academy of Sciences, University of Adelaide
ACM Multimedia (2024)
MM Speech Benchmark

📝 Paper Summary

Sounding Video Generation (SVG) Multi-modal Generation
MM-LDM unifies audio and video generation into a single latent diffusion framework using a hierarchical autoencoder that compresses signals into low-level perceptual latents and high-level shared semantic features.
Core Problem
Generating aligned audio and video simultaneously is difficult due to high dimensionality, disparate data formats (1D audio vs. 3D video), and distinct semantic content patterns.
Why it matters:
  • Current methods like MM-Diffusion operate in high-dimensional signal space, causing massive computational burdens and slow sampling
  • Existing attention-based fusion mechanisms often use small windows (e.g., size 8), limiting cross-modal context and leading to poor synchronization
  • Distinct representations (waveforms vs. pixel grids) make it hard for single models to capture correlations without unified formatting
Concrete Example: Previous work uses random-shift attention with small windows, preventing the model from learning global audio-video alignment. MM-LDM converts audio to images and uses full attention in a compressed latent space, enabling generation of 256x256 videos with synchronized sound.
Key Novelty
Hierarchical Multi-Modal Autoencoder with Audio-as-Image Representation
  • Converts 1D audio into 2D 'audio images' (spectrograms) to unify input formats with video frames, allowing a shared diffusion backbone
  • Establishes two feature spaces: a low-level 'perceptual' space for compression and a high-level 'semantic' space for cross-modal guidance
  • Decoders use the *semantic* features of one modality to guide the reconstruction of the *perceptual* latents of the other, ensuring content alignment
Architecture
Architecture Figure Figure 2
Overview of the MM-LDM framework, showing the Autoencoder compression and the Diffusion process
Evaluation Highlights
  • Outperforms MM-Diffusion by 114.6 FVD (Frechet Video Distance) on the AIST++ dataset
  • Reduces computational complexity significantly, achieving 10x faster sampling speed compared to signal-space baselines
  • Improves audio quality metric FAD (Frechet Audio Distance) by 2.1 points on AIST++
Breakthrough Assessment
8/10
Significant efficiency gains (10x speedup) and unified architecture for a notoriously difficult multi-modal task. The hierarchical latent design sensibly addresses the semantic gap.
×