CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before producing the final output
V2A: Video-to-Audio—the task of generating sound tracks that correspond to silent video inputs
Flow Matching: A generative modeling technique that learns a velocity field to transform noise into data, offering an alternative to diffusion models
MLLM: Multimodal Large Language Model—an LLM capable of processing and reasoning about non-text inputs like images and audio
Foley: The reproduction of everyday sound effects that are added to film, video, and other media in post-production
VAE: Variational Autoencoder—a neural network used here to compress audio into latent representations for efficient generation
ROI: Region of Interest—a specific area within a video frame selected (e.g., by user click) for targeted processing
CFG: Classifier-Free Guidance—a technique in generative models to control the strength of conditioning signals (like text or video) during sampling
AdaLN: Adaptive Layer Normalization—a mechanism to inject conditioning information (like time embeddings or global context) into network layers
DiT: Diffusion Transformer—a transformer-based architecture used for diffusion (or flow matching) models, replacing the traditional U-Net
FAD: Frechet Audio Distance—a metric for evaluating audio quality by comparing statistics of generated audio embeddings against real audio
CLAP: Contrastive Language-Audio Pretraining—a model used to compute similarity scores between audio and text/video for evaluation