Zero RL training: Reinforcement learning applied directly to a pre-trained base model without an intermediate supervised fine-tuning (SFT) stage
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs generated from the same prompt to reduce variance, eliminating the need for a separate value function critic
aha moment: The point during training where a model spontaneously exhibits advanced reasoning behaviors like self-verification or backtracking without being explicitly taught them
pass@k: A metric measuring the probability that at least one correct answer is found in k generated samples
CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer
SFT: Supervised Fine-Tuning—training a model on demonstrated examples (input-output pairs) to teach it specific behaviors or formats
format reward: A reward signal given specifically for adhering to a structural constraint (e.g., enclosing the answer in \boxed{}) rather than correctness
cold start: Initializing the RL training process with a model that has already undergone SFT, rather than the raw base model