GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes rewards within a group of outputs generated from the same input to reduce variance
Pass@1: The percentage of problems where the model's first generated answer is correct
Pass@K: The probability that at least one of K generated samples is correct
Temperature Distillation: The phenomenon where RL makes a model's performance robust to high sampling temperatures, flattening the precision curve
Coverage Wall: The limit of unique problems a model can solve even with infinite sampling (Pass@K as K approaches infinity), which RL fails to expand
Self-difficulty sorting: Ranking test problems based on the model's own precision (success rate) on them, rather than external difficulty labels
Plan Grade: The fraction of generated solutions that contain the correct sequence of high-level steps (approach) required to solve the problem
Execution Grade: The fraction of solutions with a correct plan that are also calculated/derived correctly to the final answer
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer