RLVR: Reinforcement Learning with Verifier Reward—a paradigm where a model generates samples that are scored by a verifier (function or model) to guide training
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a sampled group to compute advantages, removing the need for a separate value network
Reward Hacking: When a model exploits flaws in the reward function (e.g., getting the right answer for the wrong reason) to maximize score without true improvement
False Positive: In this context, a response that contains the correct final answer but incorrect reasoning steps, which confuses standard verifiers
Reachability: The probability that the policy model can generate at least one correct response during exploration
Identifiability: The ability of the verifier to correctly determine the true quality of a response given its available context
Context Rollback: A technique where the extra context (mistake reports) used to generate a sample is removed before adding the sample to the training buffer, ensuring the policy learns to generate it independently