RLVR: Reinforcement Learning with Verifiable Rewards—a training method where models improve by receiving feedback based on the objective correctness of their answers
Lucky Guess: A scenario where a reasoning model arrives at the correct final answer despite using incorrect logic, formulas, or derivation steps
Consensus Score: A metric used during dataset construction: the average agreement rate of a proxy verifier across multiple trials, used to identify 'Hard-to-Verify' samples
Process-Outcome Alignment: The requirement that a correct final answer must be the result of a logically valid derivation process
AIME: American Invitational Mathematics Examination—a challenging math competition used as a benchmark for advanced reasoning capabilities
GPQA: Google-Proof Q&A—a difficult science benchmark designed to be resistant to simple web search
SFT: Supervised Fine-Tuning—training a model on a dataset of correct examples before applying reinforcement learning
PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm used to update the model's policy based on reward signals