dLLM: Discrete Diffusion Large Language Model—a text generation model that generates tokens via a denoising process rather than autoregressively
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines using group averages of rewards
ELBO: Evidence Lower Bound—a tractable proxy used to estimate the intractable log-likelihood of diffusion models
Importance Ratio: The ratio of the target policy probability to the behavior policy probability, used to reweight samples in RL updates
Unconditional Clipping: A mechanism in StableDRL that limits importance ratios to a trust region regardless of whether the update improves or worsens the objective
Self-Normalization: Normalizing the gradient update by the sum of importance weights rather than the number of samples, ensuring the update stays within the geometric scope of the gradients
Staircase Attention: A masking pattern for block diffusion models that allows a block to see clean history while masking its own targets, enabling efficient likelihood estimation