dLLM: Discrete Diffusion Large Language Model—a generative model that creates text by iteratively denoising a random sequence rather than predicting the next token.
MDLM: Masked Diffusion Language Model—a specific type of dLLM that learns to reconstruct randomly masked tokens.
BDLM: Block Diffusion Language Model—generates text in contiguous blocks; within a block, tokens are generated via diffusion, while blocks may be generated sequentially.
AR: Auto-regressive—models that generate text one token at a time, strictly left-to-right (e.g., GPT-4).
SFT: Supervised Fine-Tuning—training on instruction-response pairs to teach the model to follow commands.
DPO: Direct Preference Optimization—an alignment method that optimizes the model to prefer human-chosen responses over rejected ones without a separate reward model.
WSD: Warmup-Stable-Decay—the proposed three-phase training schedule: gradually increasing block size, training on full sequences, then decreasing block size.
KV-cache: Key-Value cache—storing attention computations for past tokens to speed up sequential generation.
Top-k checkpoint merging: Averaging the weights of the k best-performing model checkpoints to improve stability and generalization.