dLLM: Diffusion Language Model—generates text by iteratively denoising a full sequence rather than predicting tokens one by one
AR: Autoregressive—models that generate text sequentially from left to right (Next-Token Prediction)
recency bias: The tendency of a model's representations to change substantially with every new token generated; common in AR models
FLOPs: Floating Point Operations—a measure of computational cost
initialization bias: The phenomenon where a model retains the representational properties of its pre-trained starting point (e.g., AR) even after fine-tuning with a different objective (e.g., diffusion)
KV-cache: Key-Value Cache—a technique to store previous token computations to speed up autoregressive generation
coarse-to-fine: A representational hierarchy where early layers process broad, global features and later layers refine specific details
cosine similarity: A metric used here to measure how much the hidden state representation changes between two consecutive layers