RLVR: Reinforcement Learning with Verifiable Rewards—using binary correct/incorrect feedback based on the final answer to train reasoning models
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs for the same prompt, removing the need for a separate value network
Dr.GRPO: A variant of GRPO that modifies the advantage function by removing the standard deviation term in the denominator for better stability
Entropy Collapse: A phenomenon where a policy becomes deterministic too early in training, stopping exploration and producing identical outputs
Maj@k: Majority vote accuracy at k—generating k solutions and selecting the most frequent answer as the final prediction
Pass@k: The probability that at least one of k generated solutions is correct
Policy Nucleus: The subset of tokens in the vocabulary with the highest cumulative probability (top-p), representing semantically meaningful options
Self-anchored Regularization: A regularization term that penalizes deviations from the model's initial aggregated entropy rather than maximizing entropy indiscriminately