← Back to Paper List

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xi-Dai Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Hua-yong Chen, Xiaoye Qu, Yafu Li, Weize Chen, et al.
Tsinghua University, Shanghai AI Laboratory, Shanghai Jiao Tong University
arXiv.org (2025)
RL Reasoning Agent MM Pretraining Benchmark

📝 Paper Summary

Reinforcement Learning (RL) for LLMs Large Reasoning Models (LRMs) Reward Design
This survey systematically reviews the shift in Reinforcement Learning for LLMs from human alignment (RLHF) to capability enhancement (RLVR), identifying verifiable rewards and test-time compute scaling as key drivers for reasoning performance.
Core Problem
Traditional RLHF focuses on aligning models with human preferences (helpfulness/harmlessness) but often fails to significantly boost complex reasoning capabilities in math and coding.
Why it matters:
  • Pre-training scaling laws (more data/parameters) are hitting diminishing returns; RL offers a new scaling axis via test-time compute.
  • Prior RL methods relying on learned reward models suffer from reward hacking and lack robustness in objective domains like math.
  • The emergence of models like OpenAI o1 and DeepSeek-R1 proves RL can induce self-correction and planning, but the methodology is fragmented across recent papers.
Concrete Example: In standard RLHF, a model might learn to produce polite but incorrect math answers because human labelers prefer the tone. In RLVR (e.g., DeepSeek-R1), the model is penalized unless the final answer matches the ground truth, forcing it to develop 'thinking' processes like self-verification to maximize the reward.
Key Novelty
Comprehensive Taxonomy of RL for Large Reasoning Models
  • Categorizes the field into foundational components: Reward Design (Verifiable vs. Generative), Policy Optimization (Critic vs. Critic-Free), and Sampling Strategies.
  • Distinguishes between 'Sharpening' (enhancing existing knowledge) and 'Discovery' (learning new capabilities), arguing RL currently excels at the former.
  • Formulates 'Verifier's Law': the ease of training AI systems is proportional to the degree to which the task is objectively verifiable.
Evaluation Highlights
  • DeepSeek-R1 (671B) matches OpenAI o1 performance on math/code benchmarks using Group Relative Policy Optimization (GRPO) with rule-based rewards.
  • OpenAI o1 performance improves smoothly with both increased train-time RL compute and test-time 'thinking' compute.
  • Kimi K2 (1T parameters) scales agentic training data synthesis using a general RL procedure for non-verifiable rewards.
Breakthrough Assessment
9/10
This is a timely and exhaustive survey capturing a major paradigm shift in LLM training (Post-Training Scaling) triggered by o1 and DeepSeek-R1. It defines the vocabulary and taxonomy for the next phase of LLM research.
×