AAPT: Autoregressive Adversarial Post-Training—the proposed method to convert diffusion models into fast autoregressive generators
Student-forcing: Training technique where the model uses its own previous generated outputs as input for the next step, rather than ground truth (teacher-forcing)
KV cache: Key-Value cache—storing attention representations of past tokens to avoid recomputing them at every step, standard in LLMs but applied here to video
NFE: Number of Function Evaluations—the number of times the neural network is run to generate an output (1 NFE means one pass)
Diffusion forcing: A method to train diffusion models for sequential generation by assigning different noise levels to different frames
Block causal attention: Attention mechanism where current tokens attend only to themselves and past tokens, preventing information leakage from future frames
Latent frame: A compressed representation of video frames (here, 1 latent frame = 4 video frames) processed by the model