Institute of Automation, Chinese Academy of Science,
School of Artificial Intelligence, University of Chinese Academy of Sciences,
University of Adelaide
ACM Multimedia
(2024)
MMSpeechBenchmark
📝 Paper Summary
Sounding Video Generation (SVG)Multi-modal Generation
MM-LDM unifies audio and video generation into a single latent diffusion framework using a hierarchical autoencoder that compresses signals into low-level perceptual latents and high-level shared semantic features.
Core Problem
Generating aligned audio and video simultaneously is difficult due to high dimensionality, disparate data formats (1D audio vs. 3D video), and distinct semantic content patterns.
Why it matters:
Current methods like MM-Diffusion operate in high-dimensional signal space, causing massive computational burdens and slow sampling
Existing attention-based fusion mechanisms often use small windows (e.g., size 8), limiting cross-modal context and leading to poor synchronization
Distinct representations (waveforms vs. pixel grids) make it hard for single models to capture correlations without unified formatting
Concrete Example:Previous work uses random-shift attention with small windows, preventing the model from learning global audio-video alignment. MM-LDM converts audio to images and uses full attention in a compressed latent space, enabling generation of 256x256 videos with synchronized sound.
Key Novelty
Hierarchical Multi-Modal Autoencoder with Audio-as-Image Representation
Converts 1D audio into 2D 'audio images' (spectrograms) to unify input formats with video frames, allowing a shared diffusion backbone
Establishes two feature spaces: a low-level 'perceptual' space for compression and a high-level 'semantic' space for cross-modal guidance
Decoders use the *semantic* features of one modality to guide the reconstruction of the *perceptual* latents of the other, ensuring content alignment
Architecture
Overview of the MM-LDM framework, showing the Autoencoder compression and the Diffusion process
Evaluation Highlights
Outperforms MM-Diffusion by 114.6 FVD (Frechet Video Distance) on the AIST++ dataset
Improves audio quality metric FAD (Frechet Audio Distance) by 2.1 points on AIST++
Breakthrough Assessment
8/10
Significant efficiency gains (10x speedup) and unified architecture for a notoriously difficult multi-modal task. The hierarchical latent design sensibly addresses the semantic gap.
⚙️ Technical Details
Problem Definition
Setting: Joint generation of video signals v and audio signals a
Inputs: Random noise (unconditional) or Partial modality (conditional generation)
Outputs: Synchronized video frames and audio waveform
Pipeline Flow
Data Preprocessing: Audio -> Audio Image, Video -> Frames
Group: Hierarchical Autoencoder (Encoding)
Latent Diffusion (Denoising)
Group: Hierarchical Autoencoder (Decoding)
System Modules
Data Preprocessor (Hierarchical Autoencoder (Encoding))
Code is publicly available at https://github.com/iva-mzsun/MM-LDM. The paper relies on pretrained HiFiGAN for audio reconstruction and KL-VAE weights. Specific hyperparameters (learning rate, batch size) are not detailed in the provided text snippet.
📊 Experiments & Results
Evaluation Setup
Unconditional and Conditional Sounding Video Generation
Benchmarks:
Landscape (Sounding Video Generation)
AIST++ (Dance Video with Music Generation)
AudioSet (Open-domain Sounding Video Generation)
Metrics:
FVD (Frechet Video Distance)
KVD (Kernel Video Distance)
FAD (Frechet Audio Distance)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
AIST++
FVD
Not reported in the paper
Not reported in the paper
-114.6
Experiment Figures
Qualitative examples of generated sounding videos
Main Takeaways
MM-LDM achieves state-of-the-art performance on AIST++, improving FVD by ~114 points and FAD by ~2 points compared to MM-Diffusion.
The method is significantly more efficient, offering 10x faster sampling speed by operating in a compressed latent space rather than raw signal space.
The hierarchical autoencoder design (separating perceptual compression from semantic alignment) is crucial for bridging the gap between audio and video modalities.
📚 Prerequisite Knowledge
Prerequisites
Latent Diffusion Models (LDM)
Variational Autoencoders (VAE)
Mel Spectrograms
Contrastive Learning
Key Terms
SVG: Sounding Video Generation—the task of simultaneously generating video frames and their corresponding audio track
FVD: Frechet Video Distance—a metric for evaluating the quality and realism of generated videos
FAD: Frechet Audio Distance—a metric for evaluating the quality and realism of generated audio
Audio Image: A 2D representation of audio (e.g., a Mel Spectrogram) treated as an image channel to unify processing with video frames
DiT: Diffusion Transformer—a diffusion model backbone that uses Transformer architecture instead of the standard U-Net
Perceptual Latent Space: A compressed feature space that preserves low-level details perceptually equivalent to raw signals
Semantic Feature Space: A high-level feature space derived from perceptual latents, optimized to align audio and video concepts