MDLM: Masked Diffusion Language Model—generates text by iteratively refining a sequence where tokens are randomly masked, allowing bidirectional context usage
BDLM: Block Diffusion Language Model—a hybrid approach where tokens are generated in blocks; diffusion is applied within blocks while blocks are generated auto-regressively
WSD: Warmup-Stable-Decay—the proposed three-phase training schedule to convert AR models to Diffusion models by manipulating block size
SFT: Supervised Fine-Tuning—training on instruction-response pairs to teach the model to follow user commands
DPO: Direct Preference Optimization—an alignment algorithm that optimizes the model to prefer higher-quality responses over lower-quality ones without a separate reward model
KV-cache: Key-Value cache—storing attention computations for previous tokens to speed up future generation steps; typically hard in diffusion but enabled here via Block Diffusion
Top-k checkpoint merging: Averaging the parameters of the k best-performing checkpoints to improve generalization and stability
document-level attention mask: A masking technique that restricts attention to within individual documents when multiple short documents are packed into one training sequence