← Back to Paper List

Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Across the LLM Lifecycle

K Liu, D Yang, Z Qian, W Yin, Y Wang, H Li, J Liu…
Not explicitly reported in the paper
arXiv, 9/2025 (2025)
RL Reasoning Pretraining Benchmark

📝 Paper Summary

Reinforcement Learning for LLMs LLM Alignment Reasoning
This survey provides a comprehensive lifecycle review of reinforcement learning in Large Language Models, emphasizing the emerging paradigm of Reinforcement Learning with Verifiable Rewards (RLVR) to enhance reasoning capabilities.
Core Problem
Existing surveys on RL for LLMs are often limited in scope, focusing primarily on alignment (RLHF) while overlooking the role of RL in pre-training and, crucially, the recent advancements in verified reasoning (RLVR).
Why it matters:
  • LLMs struggle with complex reasoning and can produce misleading outputs despite general capabilities
  • There is a lack of consensus on how to apply RL across the full LLM lifecycle, from pre-training to post-training inference
  • Practical design decisions for RLVR (data curation, reward definitions) remain scattered and unorganized in current literature
Concrete Example: Current LLMs often fail at multi-step mathematical reasoning because they lack objective feedback during training. RLVR addresses this by rewarding the model only when it produces a solution that passes a programmatic check (e.g., a unit test or theorem proof), pushing the model to self-correct until a verifiable result is found.
Key Novelty
Lifecycle-based Taxonomy with RLVR Focus
  • Organizes RL methods into a full lifecycle framework: Pre-training, Alignment Fine-tuning, and Reinforced Reasoning
  • Specifically highlights Reinforcement Learning with Verifiable Rewards (RLVR) as a distinct and critical phase for advancing reasoning capabilities beyond standard RLHF
  • Integrates a review of datasets, benchmarks, and open-source frameworks specifically tailored for these RL stages
Evaluation Highlights
  • DeepSeek-R1-Zero achieves 71.0% pass@1 on AIME 2024, surpassing the 2.6% of DeepSeek-V3-Base
  • Qwen2.5-Math-7B-Instruct achieves 95.2% on GSM8K using RLVR techniques, outperforming the base Qwen2.5-7B's 79.8%
  • OpenAI o1 achieves 83.3% pass@1 on AIME 2024, compared to GPT-4o's 13.4%
Breakthrough Assessment
8/10
A timely and necessary survey that systematizes the rapidly evolving field of RLVR, connecting it to the broader history of RLHF and pre-training RL.
×