event curve: A 1D temporal signal representing the magnitude of change over time, calculated via cosine similarity between consecutive feature vectors within a single modality
rectified flow: A generative model that learns a transport map (velocity field) to transform a simple prior distribution (noise) into the data distribution via an ordinary differential equation (ODE)
DiT: Diffusion Transformer—a neural network architecture that uses transformers instead of U-Nets for the backbone of diffusion-based generative models
intra-modal similarity: The similarity between data points (e.g., frames or audio segments) within the same modality, used here to detect structure regardless of content
zero-pair: A training setting where the model never sees paired examples of input (video) and output (music) together; it learns from independent datasets
FAD: Fréchet Audio Distance—a metric for evaluating the quality of generated audio by comparing statistics of embeddings
CLAP: Contrastive Language-Audio Pretraining—a model used to measure semantic similarity between audio and text (or video)
beat alignment: A metric measuring how well musical beats coincide with visual events like dance moves or scene cuts