← Back to Paper List

Multi-modal Latent Diffusion

Mustapha Bounoua, Giulio Franzese, Pietro Michiardi
Renault Software Factory, EURECOM, France
arXiv (2023)
MM Benchmark

📝 Paper Summary

Multi-modal generative modeling Latent diffusion models
MLD replaces complex multi-modal VAE posteriors with a score-based diffusion model operating on the concatenated latent space of independently trained deterministic autoencoders.
Core Problem
Existing multi-modal VAEs suffer from a coherence–quality tradeoff: models with good generation quality lack consistency across modalities, and coherent models produce poor quality samples.
Why it matters:
  • Current approaches (Product of Experts, Mixture of Experts) suffer from latent variable collapse or information loss due to mixture sub-sampling
  • Applications like data augmentation and missing modality imputation require both high fidelity and strict semantic alignment between modalities (e.g., image and sound)
  • Reducing encoder/decoder flexibility to improve coherence hurts generative quality, creating a fundamental bottleneck in VAE-based designs
Concrete Example: In the MNIST-SVHN dataset, VAE-based models often fail to generate the correct digit in the SVHN modality given an MNIST digit (poor coherence), or generate blurry, unrecognizable digits to maintain coherence (poor quality). MLD generates sharp, correct SVHN digits from MNIST inputs.
Key Novelty
Multi-modal Latent Diffusion (MLD)
  • Decouples modality encoding from joint modeling: uses independent, deterministic autoencoders for each modality to avoid information loss and gradient conflicts
  • Concatenates individual latent representations into a single joint latent space, then learns the joint distribution using a score-based diffusion model
  • Introduces a 'multi-time' training scheme where the diffusion model learns to handle arbitrary subsets of missing modalities via randomized masking during training
Evaluation Highlights
  • Achieves 85.22% joint coherence on MNIST-SVHN, outperforming the best baseline (MVTCAE) by over +36pp
  • Reduces FID (lower is better) on MNIST-SVHN Joint(S) generation to 57.2, compared to 69.48 for the next best baseline
  • On the 5-modality POLYMNIST dataset, achieves near-perfect coherence (>98%) across almost all joint generation tasks, significantly surpassing VAE-based competitors
Breakthrough Assessment
8/10
Significantly outperforms established VAE baselines on the coherence-quality tradeoff. The architectural shift to diffusion on concatenated deterministic latents is a strong, effective simplification.
×