DiT: Diffusion Transformer—a neural network architecture for generating images/videos that uses Transformer blocks instead of the traditional U-Net
MMDiT: Multimodal Diffusion Transformer—a variant where text and visual tokens have separate weights but interact via self-attention
RoPE: Rotary Positional Embedding—a method to encode position information by rotating token representations in vector space
LCT: Long Context Tuning—the authors' proposed method to extend single-shot models to multi-shot contexts
Rectified Flow: A training formulation for diffusion models that learns a straight path between noise and data, often improving generation speed and quality
KV-cache: Key-Value cache—storing previous calculation results to speed up auto-regressive generation so they don't need to be recomputed
Auto-regressive: Generating a sequence piece-by-piece, where each new piece depends on what was generated before
Shot: A continuous footage sequence filmed by a single camera without interruption
Scene: A series of shots capturing coherent events unfolding over time (e.g., a conversation)
logit-normal distribution: A probability distribution used here to sample diffusion timesteps, ensuring varied noise levels across training examples