MLD: Multi-modal Latent Diffusion—the authors' proposed method using deterministic autoencoders and latent diffusion
coherence: The semantic consistency between generated modalities (e.g., if an image shows a '3', the generated audio should say 'three')
MOPOE: Mixture of Product of Experts—a VAE-based baseline combining mixture and product aggregations
FID: Fréchet Inception Distance—a metric for assessing the quality of generated images by comparing feature distributions
FAD: Fréchet Audio Distance—similar to FID but for evaluating audio quality
FMD: Fréchet Modality Distance—generalization of FID used for specific modalities like MNIST
ELBO: Evidence Lower Bound—the objective function maximized in Variational Autoencoders
SDE: Stochastic Differential Equation—a mathematical model describing the evolution of the diffusion process over continuous time
classifier-free guidance: A technique in diffusion models to control generation using a conditioning signal without a separate classifier
latent collapse: A failure mode in VAEs where the latent variable carries no information about the input, ignoring the encoder
Euler-Maruyama integrator: A numerical method used to solve Stochastic Differential Equations (simulate the diffusion process)
CLIP-Score: A metric measuring the semantic similarity between images and text captions