Rectified Flow Matching: A generative model method that learns a straight path (velocity field) between noise and data distributions, often simpler and faster than standard diffusion
RoPE: Rotary Positional Embedding—a method to encode position information by rotating token embeddings in geometric space
ECTF: Efficient Complete Teacher Forcing—a training strategy that masks attention so clean history is used to predict noisy targets, avoiding redundant re-computation of history for every noise level
MMDiT: Multi-Modal Diffusion Transformer—an architecture that uses separate weights for different modalities within a transformer block
CFG: Classifier-Free Guidance—a technique to improve generation quality by extrapolating between conditional and unconditional model predictions
VAE: Variational Autoencoder—used here to compress images into latent space for generation
ViT: Vision Transformer—used here to extract high-level semantic features for understanding
NTP: Next Token Prediction—the standard objective function for training language models