← Back to Paper List

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, R. Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, A. Liu, Bing Xue, Bing-Li Wang, Bochao Wu, B. Feng, Chengda Lu, Chenggang Zhao, C. Deng, Chenyu Zhang, C. Ruan, Damai Dai, et al.
DeepSeek-AI
Nature (2025)
RL Reasoning Benchmark

📝 Paper Summary

Large Language Model Reasoning Reinforcement Learning for LLMs Chain-of-Thought (CoT)
DeepSeek-R1 demonstrates that reasoning capabilities can emerge in LLMs via pure reinforcement learning on verifiable tasks without human-annotated supervision, and these capabilities can be distilled into smaller models.
Core Problem
Current reasoning models rely heavily on extensive human-annotated chain-of-thought data, which is hard to scale, introduces cognitive bias, and caps performance at the human level.
Why it matters:
  • Supervised fine-tuning (SFT) on human data limits models to replicating human thought processes, preventing the discovery of superior, non-human reasoning pathways
  • Obtaining high-quality, multi-step reasoning trajectories for complex tasks is resource-intensive and difficult to scale
Concrete Example: When solving a complex math problem, a standard SFT model might mimic a human's linear solution path. In contrast, DeepSeek-R1-Zero, trained via pure RL, naturally develops behaviors like backtracking ('Wait, wait. Wait. That’s an aha moment...') and self-correction without being explicitly taught these strategies.
Key Novelty
Pure RL for Emergent Reasoning (DeepSeek-R1-Zero) & Cold-Start Reinforced Distillation (DeepSeek-R1)
  • DeepSeek-R1-Zero bypasses SFT entirely, applying RL directly to a base model using rule-based verification (math/code) to incentivize the emergence of long chain-of-thought, self-reflection, and verification
  • DeepSeek-R1 refines this by using a small 'cold-start' dataset of readable CoT to fix language mixing, followed by RL, rejection sampling on the resulting reasoning traces, and a final RL stage for preference alignment
Evaluation Highlights
  • DeepSeek-R1 achieves 79.8% Pass@1 on AIME 2024, surpassing OpenAI-o1-1217 (79.2%) and significantly outperforming DeepSeek-V3 (39.2%)
  • On MATH-500, DeepSeek-R1 scores 97.3%, matching OpenAI-o1-1217 (96.4%) and outperforming GPT-4o (81.4%)
  • Distilled DeepSeek-R1-Distill-Llama-70B achieves 70.0% on AIME 2024, setting a new record for open-weights models and outperforming the proprietary GPT-4o-0513 (9.3%)
Breakthrough Assessment
10/10
Proves pure RL can drive emergent reasoning (including self-verification) without SFT, matching closed-source frontier models (o1) and enabling high-performance distillation to smaller open models.
×