Multi-modal Latent Diffusion

📝 Paper Summary

Multi-modal generative modeling Latent diffusion models

MLD replaces complex multi-modal VAE posteriors with a score-based diffusion model operating on the concatenated latent space of independently trained deterministic autoencoders.

Core Problem

Existing multi-modal VAEs suffer from a coherence–quality tradeoff: models with good generation quality lack consistency across modalities, and coherent models produce poor quality samples.

Why it matters:

Current approaches (Product of Experts, Mixture of Experts) suffer from latent variable collapse or information loss due to mixture sub-sampling
Applications like data augmentation and missing modality imputation require both high fidelity and strict semantic alignment between modalities (e.g., image and sound)
Reducing encoder/decoder flexibility to improve coherence hurts generative quality, creating a fundamental bottleneck in VAE-based designs

Concrete Example: In the MNIST-SVHN dataset, VAE-based models often fail to generate the correct digit in the SVHN modality given an MNIST digit (poor coherence), or generate blurry, unrecognizable digits to maintain coherence (poor quality). MLD generates sharp, correct SVHN digits from MNIST inputs.

Key Novelty

Multi-modal Latent Diffusion (MLD)

Decouples modality encoding from joint modeling: uses independent, deterministic autoencoders for each modality to avoid information loss and gradient conflicts
Concatenates individual latent representations into a single joint latent space, then learns the joint distribution using a score-based diffusion model
Introduces a 'multi-time' training scheme where the diffusion model learns to handle arbitrary subsets of missing modalities via randomized masking during training

Evaluation Highlights

Achieves 85.22% joint coherence on MNIST-SVHN, outperforming the best baseline (MVTCAE) by over +36pp
Reduces FID (lower is better) on MNIST-SVHN Joint(S) generation to 57.2, compared to 69.48 for the next best baseline
On the 5-modality POLYMNIST dataset, achieves near-perfect coherence (>98%) across almost all joint generation tasks, significantly surpassing VAE-based competitors

Breakthrough Assessment

8/10

Significantly outperforms established VAE baselines on the coherence-quality tradeoff. The architectural shift to diffusion on concatenated deterministic latents is a strong, effective simplification.

⚙️ Technical Details

Problem Definition

Setting: Generative modeling of multi-modal data X = {X1, ..., XM} sampled from pD, aiming for both joint generation and conditional generation of missing modalities given subsets.

Inputs: A set of modalities (e.g., images, text, audio), potentially with some modalities missing.

Outputs: Generated samples for the missing modalities or new joint samples for all modalities.

Pipeline Flow

Uni-modal Encoding (Independent Autoencoders)
Latent Concatenation
Masked Diffusion Process (Score Network)
Uni-modal Decoding

System Modules

Uni-modal Encoders (Input Processing)

Project each modality into a modality-specific latent space deterministically

Model or implementation: Deterministic Autoencoders (modality-specific architectures)

Latent Concatenator (Input Processing)

Combine individual latents into a joint latent vector

Model or implementation: Concatenation operation

Score Network

Estimate the score (gradient of log-density) to reverse the diffusion process

Model or implementation: Stacked MLP with skip connections

Uni-modal Decoders

Map generated latent vectors back to data space

Model or implementation: Deterministic Decoders (matching the pre-trained encoders)

Novel Architectural Elements

Concatenation of deterministic latent spaces as the primary joint representation (vs. product/mixture of experts)
Multi-time diffusion training: a single score network conditioned on a time vector that indicates which modalities are 'observed' (frozen) and which are being generated

Modeling

Base Model: Score network is a Stacked MLP with skip connections

Training Method: Two-stage training: (1) Independent deterministic autoencoders, (2) Score-based diffusion on latents

Objective Functions:

Purpose: Train autoencoders to reconstruct inputs.

Formally: L = Σ Li, where Li = ∫ pD(xi) ||xi - di(ei(xi))||^2 dxi
Purpose: Train score network to match the noise score.

Formally: Score matching objective minimizing E[||s(Rt, t) - ∇log q(Rt, t)||^2]

Training Data:

MNIST-SVHN (paired digits)
MHD (Multi-modal Handwritten Digits: Image, Trajectory, Sound)
POLYMNIST (5 modalities of MNIST with different backgrounds)
Caltech Birds (CUB) (Image, Text/Caption)

Key Hyperparameters:

diffusion_steps_inference: N (defined by step size T/N)
mask_distribution_prob_empty: d (probability of unconditional generation during training)
mask_distribution_prob_subset: (1-d)/(2^M - 1)

Compute: Not reported in the paper

Comparison to Prior Work

vs. MVAE/MMVAE/MOPOE: MLD uses deterministic encoders + diffusion rather than stochastic encoders + variational bounds, avoiding the information bottleneck and coherence-quality tradeoff.
vs. Tang et al. (2023) [not cited in paper]: Tang et al. compose modality-specific diffusion models via cross-attention; MLD learns a joint distribution over a concatenated latent space.
vs. MMVAE+: MLD concatenates latents directly rather than enforcing separated shared/private spaces.

Limitations

Inference speed is slower than VAEs due to the iterative diffusion sampling process (solving SDE)
Requires training separate autoencoders for each modality first (two-stage)
Performance on CUB images limited by the capacity of the simple autoencoder used (though scalable)

Reproducibility

No code URL provided in the paper. The method relies on standard autoencoder architectures and score matching, but specific hyperparameters for the score network architecture are described as 'simple stacked MLP'.

📊 Experiments & Results

Evaluation Setup

Joint and conditional generation across multiple modalities.

Benchmarks:

MNIST-SVHN (Bi-modal generation (Simple Image + Complex Image))
MHD (Multi-modal Handwritten Digits) (Tri-modal generation (Image, Trajectory, Sound))
POLYMNIST (5-modal generation (Images with different backgrounds))
CUB (Caltech Birds) (Image-Text generation)

Metrics:

Coherence (%) (using pre-trained classifiers)
FID (Fréchet Inception Distance)
FAD (Fréchet Audio Distance)
FMD (Fréchet Modality Distance)
CLIP-Score
Statistical methodology: Results averaged over 5 seeds. Standard deviations reported in appendix.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MNIST-SVHN results demonstrate MLD's superiority in both coherence and quality, particularly for the challenging SVHN modality.
MNIST-SVHN	Joint Coherence	48.78	85.22	+36.44
MNIST-SVHN	Joint(S) FID	69.48	57.2	-12.28
MNIST-SVHN	M→S Conditional Coherence	49.78	79.13	+29.35
MHD results show MLD handles heterogeneous modalities (Sound vs Image) better than baselines.
MHD	Joint Coherence	48.84	98.34	+49.50
MHD	Sound FAD (Joint)	13.65	2.07	-11.58
POLYMNIST results confirm scalability to >2 modalities.
POLYMNIST	Coherence (average)	~90	~99	+9

Experiment Figures

Qualitative comparison on MNIST-SVHN for conditional generation (MNIST → SVHN and SVHN → MNIST).

Main Takeaways

MLD consistently breaks the coherence-quality tradeoff observed in VAEs, achieving state-of-the-art results in both metrics simultaneously.
The 'multi-time' training method effectively enables a single network to handle any combination of conditional generation tasks without retraining.
Independent training of autoencoders prevents 'modality collapse' (where weak modalities are ignored), a common issue in end-to-end trained Multi-modal VAEs.
The method generalizes well to heterogeneous data types (Audio, Image, Trajectory, Text), outperforming baselines that struggle with diverse modalities.

📚 Prerequisite Knowledge

Prerequisites

Variational Autoencoders (VAEs) and the ELBO objective
Score-based Generative Models (SDE formulation)
Diffusion Models (Forward and Reverse processes)
Product of Experts / Mixture of Experts in multi-modal learning

Key Terms

MLD: Multi-modal Latent Diffusion—the authors' proposed method using deterministic autoencoders and latent diffusion

coherence: The semantic consistency between generated modalities (e.g., if an image shows a '3', the generated audio should say 'three')

MOPOE: Mixture of Product of Experts—a VAE-based baseline combining mixture and product aggregations

FID: Fréchet Inception Distance—a metric for assessing the quality of generated images by comparing feature distributions

FAD: Fréchet Audio Distance—similar to FID but for evaluating audio quality

FMD: Fréchet Modality Distance—generalization of FID used for specific modalities like MNIST

ELBO: Evidence Lower Bound—the objective function maximized in Variational Autoencoders

SDE: Stochastic Differential Equation—a mathematical model describing the evolution of the diffusion process over continuous time

classifier-free guidance: A technique in diffusion models to control generation using a conditioning signal without a separate classifier

latent collapse: A failure mode in VAEs where the latent variable carries no information about the input, ignoring the encoder

Euler-Maruyama integrator: A numerical method used to solve Stochastic Differential Equations (simulate the diffusion process)

CLIP-Score: A metric measuring the semantic similarity between images and text captions