← Back to Paper List

Teaching Large Language Models to Reason with Reinforcement Learning

Alex Havrilla, Yuqing Du, S. Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, R. Raileanu
Meta, Georgia Institute of Technology, StabilityAI, University of California, Berkeley
arXiv.org (2024)
RL Reasoning Benchmark

📝 Paper Summary

LLM Reasoning Reinforcement Learning for LLMs
Expert Iteration outperforms PPO and other RL methods on math reasoning tasks while achieving similar sample complexity, largely because models struggle to explore beyond their SFT initialization.
Core Problem
It is unclear which RL algorithms, reward schemes, and initializations are most effective for improving LLM reasoning, or why certain methods succeed over others.
Why it matters:
  • RLHF is the dominant paradigm for alignment, but its application to complex reasoning is less understood
  • Understanding sample complexity and exploration bottlenecks is critical for scaling up reasoning capabilities efficiently
  • Supervised fine-tuning often improves greedy accuracy at the cost of solution diversity (pass@96), a trade-off that needs addressing
Concrete Example: When fine-tuning a pretrained model on math problems, PPO often requires hyperparameter tuning and large memory, whereas Expert Iteration simply filters correct samples from the model itself. The paper investigates if PPO's complexity yields better reasoning than this simpler baseline.
Key Novelty
Systematic benchmarking of Expert Iteration vs. PPO for Reasoning
  • Compares Expert Iteration (EI), PPO, and Return-Conditioned RL across multiple model sizes (7B, 13B) and initializations (Pretrained, SFT)
  • Identifies that for deterministic reasoning tasks, the simpler EI method consistently matches or beats PPO
  • Attributes the lack of PPO advantage to poor exploration: models rarely generate novel correct solutions outside the distribution of their supervised fine-tuning data
Evaluation Highlights
  • Expert Iteration (EI) achieves best performance, improving Llama-2-13B greedy accuracy on GSM8K from ~46% (SFT) to 53%
  • EI achieves similar sample complexity to PPO, converging with ~10^6 samples even from a pretrained checkpoint
  • RL fine-tuning improves both maj@1 (greedy) and pass@96 (diversity) simultaneously, unlike continued SFT which degrades pass@96
Breakthrough Assessment
7/10
Provides a rigorous, counter-intuitive finding that simpler methods (Expert Iteration) beat PPO for reasoning, challenging the assumption that complex online RL is necessary for this domain.
×