GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of rollouts for the same input, removing the need for a value function critic
Pass@k: A metric measuring the probability that at least one correct answer is generated in k independent attempts
RLVR: Reinforcement Learning with Verifiable Rewards—RL setting where correctness can be automatically checked (e.g., math problems, code)
Gradient Diminishing: A failure mode in GRPO where all rollouts receive identical rewards (all 0 or all 1), resulting in zero advantage and zero gradient updates
Diversity Collapse: The tendency of RL fine-tuning to narrow the model's distribution onto a single successful solution pattern, reducing exploration
Transform Augmentation: Generating semantically equivalent versions of a question (e.g., via paraphrasing) to use as training data
Pooled Advantage: Calculating the mean and standard deviation for normalization across a group of related questions (original + transforms) rather than just a single question's rollouts
KL divergence: Kullback-Leibler divergence—a statistical distance measure used here to bound the generalization gap between training and test distributions