A Survey of Reinforcement Learning for Large Reasoning Models

📝 Paper Summary

Reinforcement Learning (RL) for LLMs Large Reasoning Models (LRMs) Reward Design

This survey systematically reviews the shift in Reinforcement Learning for LLMs from human alignment (RLHF) to capability enhancement (RLVR), identifying verifiable rewards and test-time compute scaling as key drivers for reasoning performance.

Core Problem

Traditional RLHF focuses on aligning models with human preferences (helpfulness/harmlessness) but often fails to significantly boost complex reasoning capabilities in math and coding.

Why it matters:

Pre-training scaling laws (more data/parameters) are hitting diminishing returns; RL offers a new scaling axis via test-time compute.
Prior RL methods relying on learned reward models suffer from reward hacking and lack robustness in objective domains like math.
The emergence of models like OpenAI o1 and DeepSeek-R1 proves RL can induce self-correction and planning, but the methodology is fragmented across recent papers.

Concrete Example: In standard RLHF, a model might learn to produce polite but incorrect math answers because human labelers prefer the tone. In RLVR (e.g., DeepSeek-R1), the model is penalized unless the final answer matches the ground truth, forcing it to develop 'thinking' processes like self-verification to maximize the reward.

Key Novelty

Comprehensive Taxonomy of RL for Large Reasoning Models

Categorizes the field into foundational components: Reward Design (Verifiable vs. Generative), Policy Optimization (Critic vs. Critic-Free), and Sampling Strategies.
Distinguishes between 'Sharpening' (enhancing existing knowledge) and 'Discovery' (learning new capabilities), arguing RL currently excels at the former.
Formulates 'Verifier's Law': the ease of training AI systems is proportional to the degree to which the task is objectively verifiable.

Evaluation Highlights

DeepSeek-R1 (671B) matches OpenAI o1 performance on math/code benchmarks using Group Relative Policy Optimization (GRPO) with rule-based rewards.
OpenAI o1 performance improves smoothly with both increased train-time RL compute and test-time 'thinking' compute.
Kimi K2 (1T parameters) scales agentic training data synthesis using a general RL procedure for non-verifiable rewards.

Breakthrough Assessment

9/10

This is a timely and exhaustive survey capturing a major paradigm shift in LLM training (Post-Training Scaling) triggered by o1 and DeepSeek-R1. It defines the vocabulary and taxonomy for the next phase of LLM research.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning where an LLM (Policy π) generates a sequence (Action a) given a prompt (State s) to maximize a Reward R.

Inputs: Prompt x drawn from dataset D

Outputs: Reasoning chain and final answer y

Pipeline Flow

Prompt Collection (Static Corpus or Dynamic Environment)
Policy Generation (Sampling multiple outputs)
Reward Evaluation (Rule-based, Model-based, or Hybrid)
Policy Update (PPO, GRPO, DPO, etc.)

System Modules

Policy Model

Generates reasoning traces and answers

Model or implementation: Various (e.g., DeepSeek-R1 671B, Qwen-2.5-Math, Llama-3)

Reward Function

Assigns scalar or vector rewards to generated outputs

Model or implementation: Rule-based scripts (compilers, math checkers) or Learned Models (ORMs/PRMs)

Novel Architectural Elements

Shift from separate Critic models (PPO) to Group-based estimation (GRPO) to save memory and scale to larger models.
Integration of <think> tokens to explicitly model latent reasoning space during RL training.

Comparison to Prior Work

vs. RLHF: RL for LRMs emphasizes Verifiable Rewards (RLVR) over preference modeling to drive objective reasoning capability rather than subjective alignment.
vs. DPO: RL methods like GRPO allow for online exploration and self-generated data, whereas DPO is typically offline and limited by the static preference dataset [not cited in paper as direct contrast but implied by 'Online vs Static' discussion].

Limitations

Reliance on Verifiable Rewards limits applicability to domains like creative writing or open-ended chat where ground truth is scarce.
Computational cost of 'thinking' (test-time compute) and large-scale RL exploration is high.
Risk of 'Reward Hacking' persists even with verifiers if the verifier (e.g., a unit test) is incomplete or exploitable.
Lack of standardized infrastructure for large-scale RL training compared to pre-training.

Reproducibility

Code: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs

The paper is a survey but points to the 'Awesome-RL-for-LRMs' GitHub repository which aggregates papers and code. It discusses open-weights models like DeepSeek-R1, QwQ-32B, and Skywork-R1V2 which are publicly available.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning basics (Policy Gradient, PPO)
LLM Post-training methods (RLHF, DPO)
Familiarity with reasoning benchmarks (GSM8K, MATH, HumanEval)

Key Terms

LRM: Large Reasoning Model—an LLM specifically optimized for complex reasoning tasks (math, code, logic) via RL.

RLVR: Reinforcement Learning with Verifiable Rewards—using objective, rule-based signals (e.g., unit tests, correct answers) instead of human preference models.

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same prompt, eliminating the need for a separate value function (critic) model.

PPO: Proximal Policy Optimization—a standard RL algorithm that updates policies with a clipped objective to ensure stability.

RLHF: Reinforcement Learning from Human Feedback—aligning models using a reward model trained on human preferences.

DPO: Direct Preference Optimization—optimizing the policy directly on preference data without an explicit reward model.

CoT: Chain-of-Thought—intermediate reasoning steps generated by the model before the final answer.

GenRM: Generative Reward Model—an LLM-based reward model that produces textual critiques/reasoning rather than just a scalar score.

Process Reward Model (PRM): A reward model that evaluates intermediate steps of reasoning rather than just the final outcome.

Verifier's Law: The principle that tasks with robust automated verification (e.g., math, code) are easiest to improve via RL.