RLVR: Reinforcement Learning with Verifiable Rewards—training LLMs using outcomes (correct/incorrect) as the primary reward signal
Pass@k: A metric evaluating the probability that at least one correct solution is generated out of k independent samples
Pass@1: The accuracy of the model when generating a single solution
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same input to reduce variance without a value network
Policy Entropy: A measure of the randomness/diversity in the model's token predictions; low entropy indicates the model is confident but repetitive (collapsed)
Mode Collapse: A failure mode where the model converges to a limited set of outputs, losing diversity and exploration capability
Variational Problems: Synthetically generated problems that differ in wording or structure from an original problem but preserve the underlying logic and final answer
Self-play: A training paradigm where the model generates its own training data (problems and solutions) and learns from it
Reward Shaping: Modifying the raw reward signal (e.g., correct/incorrect) to guide the learning process more effectively (e.g., penalizing trivial synthetic problems)