GRPO: Group-Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a sample's reward to the mean reward of a group of samples from the same prompt, eliminating the need for a critic model
RLVR: Reinforcement Learning from Verifiable Rewards—training LLMs on tasks where correctness can be automatically checked (e.g., math, code) rather than relying on human preference models
Ordinal Rewards: Rewards that have a ranked order/graded scale (e.g., 0.0 to 1.0) rather than just binary pass/fail, often enabling partial credit
CoRPO: Correctness-Relative Policy Optimization—the proposed method which clips the GRPO baseline at a minimum correctness threshold to prevent reinforcing failures
Distribution Sharpening: The tendency of a policy to concentrate probability mass on a narrow set of high-reward solutions, reducing exploration and diversity
Baseline Clipping: The mechanism of enforcing a minimum value for the baseline (in this case, the correctness threshold) to ensure advantages for incorrect samples are never positive
OOD: Out-of-Distribution—evaluating the model on tasks or domains not seen during training to test generalization
Pass@k: A metric measuring the probability that at least one of k generated solutions is correct