GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same input, avoiding a separate value function
Pass@1: The percentage of problems where the model's single generated answer is correct
Pass@k: The probability that at least one of k generated samples is correct
CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer
SFT: Supervised Fine-Tuning—training on a fixed dataset of input-output pairs
PPO: Proximal Policy Optimization—a standard RL algorithm using a clipped objective to ensure stable policy updates
R1-Zero: A paradigm for training reasoning models via RL on base models without supervised fine-tuning data, relying on self-evolution
Shaping function: A mechanism to reweight gradients, assigning higher importance to tokens in successful refinements that the current policy considers low-probability
Eluder dimension: A measure of the complexity of a hypothesis space, used here to analyze the sample efficiency of learning
Critic/Critique: In this paper, a natural language explanation of why an answer is correct or incorrect, distinct from a scalar reward value