GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same input, avoiding the need for a separate value model
DeepSeek-R1-Zero: The initial version of the model trained via pure RL on the base model without any supervised fine-tuning data
Chain-of-Thought (CoT): Intermediate reasoning steps generated by the model before the final answer
Process Reward Model: A reward model that evaluates individual steps in a reasoning chain (not used here; this paper uses outcome-based rewards)
SFT: Supervised Fine-Tuning—training on labeled input-output pairs
Cold Start Data: A small set of high-quality, human-readable reasoning examples used to initialize the model before heavy RL to ensure readability
Aha Moment: A specific point during training where the model autonomously learns to re-evaluate its approach, characterized by terms like 'Wait' or 'Let's rethink'
Language Mixing: The phenomenon where a model switches between languages (e.g., English and Chinese) within a single reasoning chain, often seen in pure RL models
Rejection Sampling: Generating many samples from a model, filtering for correct ones using a verifier, and using those as training data for a subsequent stage
MoE: Mixture-of-Experts—a model architecture where different sub-networks (experts) are activated for different inputs
Pass@1: The percentage of problems where the model generates the correct answer in its first attempt