RLVR: Reinforcement Learning with Verifiable Rewards—post-training method using ground-truth verifiers (e.g., code execution, math answer checking) to guide LLMs.
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same prompt, removing the need for a separate value network.
DAPO: Decouple Clip and Dynamic Sampling Policy Optimization—an enhancement of GRPO using techniques like dynamic sampling and token-level loss.
Entropy: A measure of uncertainty in the model's next-token prediction; high entropy suggests branching/reasoning points, low entropy suggests factual/syntactic completion.
KL divergence: A penalty term used in RL to prevent the trained model from drifting too far from the reference model (usually the SFT model).
Pass@1: The percentage of problems where the model generates a correct solution on its first attempt.
Pass@K: The probability that at least one of K generated solutions is correct.
Gradient masking: A technique where gradients for certain tokens are zeroed out to prevent them from being updated during training.
SFT: Supervised Fine-Tuning—the initial training phase on labeled data before RL.
Clipping: Restricting the ratio of the new policy probability to the old policy probability to prevent destructively large updates.