Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Across the LLM Lifecycle

📝 Paper Summary

Reinforcement Learning for LLMs LLM Alignment Reasoning

This survey provides a comprehensive lifecycle review of reinforcement learning in Large Language Models, emphasizing the emerging paradigm of Reinforcement Learning with Verifiable Rewards (RLVR) to enhance reasoning capabilities.

Core Problem

Existing surveys on RL for LLMs are often limited in scope, focusing primarily on alignment (RLHF) while overlooking the role of RL in pre-training and, crucially, the recent advancements in verified reasoning (RLVR).

Why it matters:

LLMs struggle with complex reasoning and can produce misleading outputs despite general capabilities
There is a lack of consensus on how to apply RL across the full LLM lifecycle, from pre-training to post-training inference
Practical design decisions for RLVR (data curation, reward definitions) remain scattered and unorganized in current literature

Concrete Example: Current LLMs often fail at multi-step mathematical reasoning because they lack objective feedback during training. RLVR addresses this by rewarding the model only when it produces a solution that passes a programmatic check (e.g., a unit test or theorem proof), pushing the model to self-correct until a verifiable result is found.

Key Novelty

Lifecycle-based Taxonomy with RLVR Focus

Organizes RL methods into a full lifecycle framework: Pre-training, Alignment Fine-tuning, and Reinforced Reasoning
Specifically highlights Reinforcement Learning with Verifiable Rewards (RLVR) as a distinct and critical phase for advancing reasoning capabilities beyond standard RLHF
Integrates a review of datasets, benchmarks, and open-source frameworks specifically tailored for these RL stages

Evaluation Highlights

DeepSeek-R1-Zero achieves 71.0% pass@1 on AIME 2024, surpassing the 2.6% of DeepSeek-V3-Base
Qwen2.5-Math-7B-Instruct achieves 95.2% on GSM8K using RLVR techniques, outperforming the base Qwen2.5-7B's 79.8%
OpenAI o1 achieves 83.3% pass@1 on AIME 2024, compared to GPT-4o's 13.4%

Breakthrough Assessment

8/10

A timely and necessary survey that systematizes the rapidly evolving field of RLVR, connecting it to the broader history of RLHF and pre-training RL.

⚙️ Technical Details

Problem Definition

Setting: Optimization of Large Language Model policies across three stages: Pre-training, Alignment, and Reasoning

Inputs: Natural language prompts, instructions, or reasoning problems

Outputs: Textual responses, code solutions, or reasoning chains

Pipeline Flow

Pre-training (RL for token prediction optimization)
Alignment Fine-tuning (RLHF for human preference)
Reinforced Reasoning (RLVR for complex problem solving)

System Modules

Pre-training Phase (Lifecycle Stage)

Optimize token generation policy using RL objectives alongside standard cross-entropy loss

Model or implementation: Various base LLMs

Alignment Phase (Lifecycle Stage)

Align model with human values and instructions using RLHF

Model or implementation: SFT-initialized LLMs

Reasoning Phase (Lifecycle Stage)

Enhance logical solving capabilities using RLVR

Model or implementation: Aligned Models

Novel Architectural Elements

Lifecycle-based categorization of RL application in LLMs

Modeling

Base Model: Survey covers multiple models (GPT-4, Claude, DeepSeek, Qwen)

Training Method: Various (PPO, GRPO, REINFORCE, Rejection Sampling)

Objective Functions:

Purpose: Maximize expected cumulative reward.

Formally: J(θ) = E[R]
Purpose: PPO clipped surrogate objective to ensure stable updates.

Formally: L^CLIP(θ) = E[min(r_t(θ)A_t, clip(r_t(θ), 1-ε, 1+ε)A_t)]
Purpose: GRPO objective using group relative advantage.

Formally: J_GRPO(θ) = E[min(r_t(θ)A_i, clip(r_t(θ), 1-ε, 1+ε)A_i) - β * D_KL(π_θ || π_ref)]

Key Hyperparameters:

ppo_clip_epsilon: Not reported in the paper (survey)
learning_rate: Not reported in the paper (survey)
batch_size: Not reported in the paper (survey)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RLHF surveys (Wang et al., 2024): This paper includes pre-training and RLVR reasoning phases
vs. Inference-time surveys (Xu et al., 2025): This paper covers the full training lifecycle including pre-training
vs. Pternea et al. (2024): This paper organizes by lifecycle stage rather than just bidirectional collaboration

Limitations

Debate exists on whether RLVR truly expands reasoning or just elicits pre-trained knowledge
No clear consensus on the best RL techniques for each specific lifecycle stage
Data curation for high-quality reward datasets remains a significant bottleneck
Implementing RL fine-tuning at scale without destabilizing the model is difficult

Reproducibility

This is a survey paper. It lists open-source frameworks (TRL, Axolotl, DeepSpeed-Chat, VeRL) and datasets but does not provide a specific codebase for a new method.

📊 Experiments & Results

Evaluation Setup

Review of performance improvements reported by various state-of-the-art models using RL

Benchmarks:

GSM8K (Grade School Math)
MATH (Advanced Mathematics)
HumanEval (Code Generation)
MBPP (Code Generation)
AIME 2024 (Math Competition)

Metrics:

Pass@1
Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance improvements of DeepSeek models demonstrating the impact of RLVR on reasoning tasks.
AIME 2024	Pass@1	2.6	71.0	+68.4
MATH	Pass@1	48.4	86.7	+38.3
GSM8K	Pass@1	69.2	82.8	+13.6
Performance improvements of Qwen models using RL methods.
GSM8K	Accuracy	79.8	95.2	+15.4
MATH	Accuracy	54.4	82.9	+28.5
Performance improvements of Llama models via RLHF.
MMLU	Accuracy	69.8	68.9	-0.9
GSM8K	Accuracy	56.8	59.3	+2.5

Experiment Figures

The full lifecycle of RL for LLMs, illustrating the progression from Pre-training to Alignment and finally to Reinforced Reasoning.

Main Takeaways

Reinforcement Learning, particularly RLVR, drives massive gains in reasoning tasks (Math, Code) compared to base models or standard SFT.
DeepSeek-R1-Zero demonstrates that pure RL without SFT warm-start can achieve state-of-the-art reasoning, challenging previous assumptions about the necessity of SFT.
GRPO (Group Relative Policy Optimization) is emerging as a more efficient alternative to PPO for reasoning tasks by eliminating the need for a separate value network.
RL's application has shifted from just alignment (RLHF) to a core engine for enhancing model intelligence (RLVR) in the post-training phase.

📚 Prerequisite Knowledge

Prerequisites

Fundamental concepts of Reinforcement Learning (MDP, Policy Gradient, Value Function)
Understanding of LLM training pipeline (Pre-training, SFT, RLHF)
Familiarity with recent reasoning models (e.g., OpenAI o1, DeepSeek R1)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—a paradigm where the reward signal is determined by an objective, automated check (e.g., code execution or math proof) rather than a learned reward model

RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using rewards derived from human preferences

PPO: Proximal Policy Optimization—a policy gradient algorithm that updates the policy in small, constrained steps to ensure stability

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same prompt, eliminating the need for a separate value network

MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker

DQN: Deep Q-Network—a value-based RL algorithm that uses a neural network to approximate the Q-value function

AIME: American Invitational Mathematics Examination—a challenging math competition benchmark used to evaluate reasoning capabilities

GSM8K: Grade School Math 8K—a dataset of grade school math word problems

SFT: Supervised Fine-Tuning—training a model on labeled examples (input-output pairs) before applying RL

REINFORCE: A fundamental Monte Carlo policy gradient method that estimates the gradient of expected return

TRPO: Trust Region Policy Optimization—an optimization method that ensures policy updates stay within a specified trust region to prevent performance collapse

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution