MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation

📝 Paper Summary

Sounding Video Generation (SVG) Multi-modal Generation

MM-LDM unifies audio and video generation into a single latent diffusion framework using a hierarchical autoencoder that compresses signals into low-level perceptual latents and high-level shared semantic features.

Core Problem

Generating aligned audio and video simultaneously is difficult due to high dimensionality, disparate data formats (1D audio vs. 3D video), and distinct semantic content patterns.

Why it matters:

Current methods like MM-Diffusion operate in high-dimensional signal space, causing massive computational burdens and slow sampling
Existing attention-based fusion mechanisms often use small windows (e.g., size 8), limiting cross-modal context and leading to poor synchronization
Distinct representations (waveforms vs. pixel grids) make it hard for single models to capture correlations without unified formatting

Concrete Example: Previous work uses random-shift attention with small windows, preventing the model from learning global audio-video alignment. MM-LDM converts audio to images and uses full attention in a compressed latent space, enabling generation of 256x256 videos with synchronized sound.

Key Novelty

Hierarchical Multi-Modal Autoencoder with Audio-as-Image Representation

Converts 1D audio into 2D 'audio images' (spectrograms) to unify input formats with video frames, allowing a shared diffusion backbone
Establishes two feature spaces: a low-level 'perceptual' space for compression and a high-level 'semantic' space for cross-modal guidance
Decoders use the *semantic* features of one modality to guide the reconstruction of the *perceptual* latents of the other, ensuring content alignment

Architecture

Overview of the MM-LDM framework, showing the Autoencoder compression and the Diffusion process

Evaluation Highlights

Outperforms MM-Diffusion by 114.6 FVD (Frechet Video Distance) on the AIST++ dataset
Reduces computational complexity significantly, achieving 10x faster sampling speed compared to signal-space baselines
Improves audio quality metric FAD (Frechet Audio Distance) by 2.1 points on AIST++

Breakthrough Assessment

8/10

Significant efficiency gains (10x speedup) and unified architecture for a notoriously difficult multi-modal task. The hierarchical latent design sensibly addresses the semantic gap.

⚙️ Technical Details

Problem Definition

Setting: Joint generation of video signals v and audio signals a

Inputs: Random noise (unconditional) or Partial modality (conditional generation)

Outputs: Synchronized video frames and audio waveform

Pipeline Flow

Data Preprocessing: Audio -> Audio Image, Video -> Frames
Group: Hierarchical Autoencoder (Encoding)
Latent Diffusion (Denoising)
Group: Hierarchical Autoencoder (Decoding)

System Modules

Data Preprocessor (Hierarchical Autoencoder (Encoding))

Unifies input formats

Model or implementation: Signal Transformation

Modal-Specific Encoders (Hierarchical Autoencoder (Encoding))

Compresses high-dimensional signals into low-dimensional perceptual latents

Model or implementation: CNN-based VAE Encoders

Semantic Projectors (Hierarchical Autoencoder (Encoding))

Extracts high-level semantic features to bridge the information gap between modalities

Model or implementation: Projection Heads

MM-LDM Backbone

Jointly denoises audio and video latents

Model or implementation: DiT (Diffusion Transformer)

Signal Decoders

Reconstructs raw signals using perceptual latents and cross-modal semantic guidance

Model or implementation: UNet-based Decoders

Novel Architectural Elements

Hierarchical feature spaces: separate Perceptual Latent Space (for compression) and Semantic Feature Space (for alignment)
Cross-modal decoding: The signal decoder of one modality is explicitly conditioned on the semantic features of the *other* modality
Unified 'Audio Image' representation allowing a shared Diffusion Transformer backbone for both audio and video generation

Modeling

Base Model: DiT (Diffusion Transformer) as the diffusion backbone; KL-VAE for autoencoding

Training Method: Adversarial Training + Contrastive Learning + Diffusion Training

Objective Functions:

Purpose: Ensure semantic alignment between audio and video features.

Formally: Contrastive Loss (InfoNCE-style) on semantic features s_a and s_v.
Purpose: Enforce high-level semantic validity.

Formally: Classification Cross-Entropy Loss on semantic features.
Purpose: Improve realism and consistency of reconstructed signals.

Formally: Audio-Video Adversarial Loss (GAN loss) on the autoencoder outputs.
Purpose: Train the diffusion model to remove noise.

Formally: MSE Loss (epsilon-prediction) on concatenated latents.

Compute: Not reported in the paper

Comparison to Prior Work

vs. MM-Diffusion: Uses latent space (vs. signal space) for 10x speedup; uses full attention (vs. small window) for better alignment
vs. AudioLDM/VideoLDM: Jointly generates both modalities using a shared backbone and cross-modal semantic guidance

Limitations

Relies on converting audio to images, which may introduce artifacts during the HiFiGAN inversion process
Requires complex multi-objective training (GAN, Contrastive, Diffusion losses)
Experimental details (hyperparameters) are not fully visible in the provided truncated text

Reproducibility

Code: https://github.com/iva-mzsun/MM-LDM

Code is publicly available at https://github.com/iva-mzsun/MM-LDM. The paper relies on pretrained HiFiGAN for audio reconstruction and KL-VAE weights. Specific hyperparameters (learning rate, batch size) are not detailed in the provided text snippet.

📊 Experiments & Results

Evaluation Setup

Unconditional and Conditional Sounding Video Generation

Benchmarks:

Landscape (Sounding Video Generation)
AIST++ (Dance Video with Music Generation)
AudioSet (Open-domain Sounding Video Generation)

Metrics:

FVD (Frechet Video Distance)
KVD (Kernel Video Distance)
FAD (Frechet Audio Distance)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AIST++	FVD	Not reported in the paper	Not reported in the paper	-114.6

Experiment Figures

Qualitative examples of generated sounding videos

Main Takeaways

MM-LDM achieves state-of-the-art performance on AIST++, improving FVD by ~114 points and FAD by ~2 points compared to MM-Diffusion.
The method is significantly more efficient, offering 10x faster sampling speed by operating in a compressed latent space rather than raw signal space.
The hierarchical autoencoder design (separating perceptual compression from semantic alignment) is crucial for bridging the gap between audio and video modalities.

📚 Prerequisite Knowledge

Prerequisites

Latent Diffusion Models (LDM)
Variational Autoencoders (VAE)
Mel Spectrograms
Contrastive Learning

Key Terms

SVG: Sounding Video Generation—the task of simultaneously generating video frames and their corresponding audio track

FVD: Frechet Video Distance—a metric for evaluating the quality and realism of generated videos

FAD: Frechet Audio Distance—a metric for evaluating the quality and realism of generated audio

Audio Image: A 2D representation of audio (e.g., a Mel Spectrogram) treated as an image channel to unify processing with video frames

DiT: Diffusion Transformer—a diffusion model backbone that uses Transformer architecture instead of the standard U-Net

Perceptual Latent Space: A compressed feature space that preserves low-level details perceptually equivalent to raw signals

Semantic Feature Space: A high-level feature space derived from perceptual latents, optimized to align audio and video concepts