DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

📝 Paper Summary

Large Language Model Reasoning Reinforcement Learning for LLMs Chain-of-Thought (CoT)

DeepSeek-R1 demonstrates that reasoning capabilities can emerge in LLMs via pure reinforcement learning on verifiable tasks without human-annotated supervision, and these capabilities can be distilled into smaller models.

Core Problem

Current reasoning models rely heavily on extensive human-annotated chain-of-thought data, which is hard to scale, introduces cognitive bias, and caps performance at the human level.

Why it matters:

Supervised fine-tuning (SFT) on human data limits models to replicating human thought processes, preventing the discovery of superior, non-human reasoning pathways
Obtaining high-quality, multi-step reasoning trajectories for complex tasks is resource-intensive and difficult to scale

Concrete Example: When solving a complex math problem, a standard SFT model might mimic a human's linear solution path. In contrast, DeepSeek-R1-Zero, trained via pure RL, naturally develops behaviors like backtracking ('Wait, wait. Wait. That’s an aha moment...') and self-correction without being explicitly taught these strategies.

Key Novelty

Pure RL for Emergent Reasoning (DeepSeek-R1-Zero) & Cold-Start Reinforced Distillation (DeepSeek-R1)

DeepSeek-R1-Zero bypasses SFT entirely, applying RL directly to a base model using rule-based verification (math/code) to incentivize the emergence of long chain-of-thought, self-reflection, and verification
DeepSeek-R1 refines this by using a small 'cold-start' dataset of readable CoT to fix language mixing, followed by RL, rejection sampling on the resulting reasoning traces, and a final RL stage for preference alignment

Evaluation Highlights

DeepSeek-R1 achieves 79.8% Pass@1 on AIME 2024, surpassing OpenAI-o1-1217 (79.2%) and significantly outperforming DeepSeek-V3 (39.2%)
On MATH-500, DeepSeek-R1 scores 97.3%, matching OpenAI-o1-1217 (96.4%) and outperforming GPT-4o (81.4%)
Distilled DeepSeek-R1-Distill-Llama-70B achieves 70.0% on AIME 2024, setting a new record for open-weights models and outperforming the proprietary GPT-4o-0513 (9.3%)

Breakthrough Assessment

10/10

Proves pure RL can drive emergent reasoning (including self-verification) without SFT, matching closed-source frontier models (o1) and enabling high-performance distillation to smaller open models.

⚙️ Technical Details

Problem Definition

Setting: Generating a reasoning trajectory (Chain-of-Thought) followed by a final answer for reasoning tasks (math, code, logic) and general queries

Inputs: Natural language prompt q

Outputs: Reasoning process enclosed in <think> tags and final answer in <answer> tags

Pipeline Flow

Cold Start Data Collection (SFT)
Reasoning-Oriented Reinforcement Learning (Stage 1 RL)
Rejection Sampling & Data Generation
Supervised Fine-Tuning (Stage 2 SFT)
Alignment Reinforcement Learning (Stage 2 RL)

System Modules

DeepSeek-V3-Base

Foundation model for R1-Zero and R1

Model or implementation: 671B parameter Mixture-of-Experts (MoE) model

GRPO Trainer

Optimizes policy using group-relative rewards

Model or implementation: GRPO algorithm

Reward System

Provides scalar signals for RL

Model or implementation: Rule-based checkers (Math/Code) + Reward Models (General)

Novel Architectural Elements

Incentivized <think> generation without explicit process supervision: The reward system enforces structure (<think> tags) and correctness but does not grade the content of the thought process, allowing the model to 'discover' reasoning patterns.

Modeling

Base Model: DeepSeek-V3-Base (671B parameters, 37B active)

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward while staying close to reference policy.

Formally: J_GRPO(θ) = E[... sum(min(ratio*A, clip(ratio)*A) - beta*DKL)] (Equation 1)
Purpose: Compute advantage without a value function.

Formally: A_i = (r_i - mean(r_1...r_G)) / std(r_1...r_G) (Equation 3)
Purpose: Enforce correct final answer.

Formally: Reward_acc checks if final answer matches ground truth
Purpose: Enforce reasoning structure.

Formally: Reward_format checks for <think> and </think> tags
Purpose: Maintain language consistency (to fix mixing).

Formally: Reward_lang = Proportion of target language words in CoT (Equation 7)

Adaptation: Full fine-tuning (assumed, given the scale and nature of base model training)

Training Data:

Reasoning data (Math, Code, Logic) with ground truth answers
Cold-start data: Few-shot prompting of long-CoT models, human refinement of R1-Zero outputs
Rejection sampling data: 600k reasoning samples from R1-Zero checkpoints
Non-reasoning data: 200k samples for writing, QA, self-cognition

Key Hyperparameters:

learning_rate: 3e-6 (R1-Zero and Stage 1 RL)
kl_coefficient: 0.001
group_size_G: 16 (samples per question)
+ 3 more
clip_epsilon: 0.2 (implied by typical PPO/GRPO, though 0.1-0.2 is standard; paper specifically mentions clip ratio epsilon but sets to 0.2 is standard, actually paper says epsilon=0.2 in eq 1 but later says epsilon to 10? Wait, checking text: 'GRPO clip ratio epsilon to 10'? No, likely typo or specific large clip. Actually text says: 'GRPO clip ratio epsilon to 10' in 3.2.1. This is unusually high.)
beta: 0.001 (KL coeff)
temperature: 1.0 (Rollout Stage 1), 0.7 (Stage 2 RL)

Compute: Not explicitly reported in the paper (implies massive scale given 671B model)

Comparison to Prior Work

vs. CoT Prompting: R1 internalizes CoT generation via RL rather than relying on prompt examples
vs. Process Reward Models: R1 uses only outcome verification (final answer correctness), avoiding the need for expensive step-level human annotation
vs. Standard SFT-based Reasoning: R1-Zero skips SFT entirely to avoid mimicking human limitations; R1 uses RL to explore reasoning paths beyond human demonstrations
+ 1 more
vs. OpenAI o1: R1 is open-weights and discloses the pure-RL training methodology (unlike o1's undisclosed 'training time compute' methods) [not cited in paper as comparison but implied context]

Limitations

Language mixing: R1-Zero struggles with switching languages mid-thought (addressed in R1 but still a challenge)
Prompt sensitivity: Performance degrades with few-shot prompting; zero-shot is recommended
Software engineering: Limited improvement on large-scale engineering tasks due to evaluation latency in RL
Reward Hacking: Pure RL is difficult for tasks with unverifiable outcomes (e.g., creative writing)

Reproducibility

Code: https://huggingface.co/deepseek-ai

DeepSeek-R1 and DeepSeek-R1-Zero weights are publicly released. Distilled models (Llama, Qwen based) are released. Training code is not released, but GRPO algorithm is described. Prompts for evaluation are standard benchmarks.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation across diverse benchmarks. For reasoning tasks, the model generates <think> content followed by the answer.

Benchmarks:

AIME 2024 (High-school Competition Math)
MATH-500 (Mathematical Reasoning)
Codeforces (Competitive Programming)
MMLU / MMLU-Pro (General Knowledge & Reasoning)
LiveCodeBench (Code Generation)
AlpacaEval 2.0 (Instruction Following/Chat)

Metrics:

Pass@1
Cons@64 (Consensus@64)
Percentile (Codeforces)
Exact Match (EM)
Statistical methodology: t-test (p < 0.01) used for significance in Table 3

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DeepSeek-R1 performance on advanced reasoning benchmarks compared to proprietary SOTA models.
AIME 2024	Pass@1	39.2	79.8	+40.6
MATH-500	Pass@1	81.4	97.3	+15.9
Codeforces	Percentile	51.6	96.3	+44.7
MMLU	Pass@1	88.5	90.8	+2.3
Distillation results showing smaller models inheriting reasoning capabilities.
AIME 2024	Pass@1	2.6	70.0	+67.4
AIME 2024	Pass@1	8.3	72.6	+64.3

Main Takeaways

Pure RL with outcome verification is sufficient to induce advanced reasoning behaviors (reflection, verification) in sufficiently large LLMs.
Reasoning patterns emerged autonomously (e.g., the 'aha moment' and self-correction) without human demonstration.
Distillation is highly effective: Small models trained on R1's traces (SFT) perform significantly better than those trained via RL discovery alone, suggesting larger models are better at 'discovering' patterns that smaller models can then 'learn'.
DeepSeek-R1 achieves performance parity with OpenAI-o1 on math and coding benchmarks.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Chain-of-Thought (CoT) Prompting
Supervised Fine-Tuning (SFT)
Language Model Distillation

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same input, avoiding the need for a separate value model

DeepSeek-R1-Zero: The initial version of the model trained via pure RL on the base model without any supervised fine-tuning data

Chain-of-Thought (CoT): Intermediate reasoning steps generated by the model before the final answer

Process Reward Model: A reward model that evaluates individual steps in a reasoning chain (not used here; this paper uses outcome-based rewards)

SFT: Supervised Fine-Tuning—training on labeled input-output pairs

Cold Start Data: A small set of high-quality, human-readable reasoning examples used to initialize the model before heavy RL to ensure readability

Aha Moment: A specific point during training where the model autonomously learns to re-evaluate its approach, characterized by terms like 'Wait' or 'Let's rethink'

Language Mixing: The phenomenon where a model switches between languages (e.g., English and Chinese) within a single reasoning chain, often seen in pure RL models

Rejection Sampling: Generating many samples from a model, filtering for correct ones using a verifier, and using those as training data for a subsequent stage

MoE: Mixture-of-Experts—a model architecture where different sub-networks (experts) are activated for different inputs

Pass@1: The percentage of problems where the model generates the correct answer in its first attempt