Flow Matching: A generative modeling framework that learns to transform a simple prior distribution (like noise) to a data distribution via a determined velocity field, often simpler to train than diffusion.
TAE: Temporal Autoencoder—a neural network that compresses video data spatially and temporally into a compact latent representation for efficient processing.
SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, high-quality dataset to improve instruction following and output quality.
Diegetic Audio: Sound that originates from a source within the video's world (e.g., footsteps, dialogue), as opposed to background music (non-diegetic).
Bi-directional Attention: An attention mechanism where every token can attend to every other token in the sequence, unlike causal attention used in text generation where tokens only attend to the past.
Latent Space: A compressed representation of data (images/video) where the generative model operates, reducing computational complexity compared to pixel space.
FSDP: Fully Sharded Data Parallel—a technique to distribute model parameters, gradients, and optimizer states across multiple GPUs to train models larger than single-GPU memory.
RoCE RDMA: RDMA over Converged Ethernet—a network protocol allowing direct memory access between GPU servers for high-speed training communication.
ODE solver: Ordinary Differential Equation solver—an algorithm used during inference in Flow Matching to compute the trajectory from noise to the final image/video.
SwiGLU: A specific activation function used in modern Transformers (like Llama) that combines the Swish activation with Gated Linear Units.