Teaching Large Language Models to Reason with Reinforcement Learning

📝 Paper Summary

LLM Reasoning Reinforcement Learning for LLMs

Expert Iteration outperforms PPO and other RL methods on math reasoning tasks while achieving similar sample complexity, largely because models struggle to explore beyond their SFT initialization.

Core Problem

It is unclear which RL algorithms, reward schemes, and initializations are most effective for improving LLM reasoning, or why certain methods succeed over others.

Why it matters:

RLHF is the dominant paradigm for alignment, but its application to complex reasoning is less understood
Understanding sample complexity and exploration bottlenecks is critical for scaling up reasoning capabilities efficiently
Supervised fine-tuning often improves greedy accuracy at the cost of solution diversity (pass@96), a trade-off that needs addressing

Concrete Example: When fine-tuning a pretrained model on math problems, PPO often requires hyperparameter tuning and large memory, whereas Expert Iteration simply filters correct samples from the model itself. The paper investigates if PPO's complexity yields better reasoning than this simpler baseline.

Key Novelty

Systematic benchmarking of Expert Iteration vs. PPO for Reasoning

Compares Expert Iteration (EI), PPO, and Return-Conditioned RL across multiple model sizes (7B, 13B) and initializations (Pretrained, SFT)
Identifies that for deterministic reasoning tasks, the simpler EI method consistently matches or beats PPO
Attributes the lack of PPO advantage to poor exploration: models rarely generate novel correct solutions outside the distribution of their supervised fine-tuning data

Evaluation Highlights

Expert Iteration (EI) achieves best performance, improving Llama-2-13B greedy accuracy on GSM8K from ~46% (SFT) to 53%
EI achieves similar sample complexity to PPO, converging with ~10^6 samples even from a pretrained checkpoint
RL fine-tuning improves both maj@1 (greedy) and pass@96 (diversity) simultaneously, unlike continued SFT which degrades pass@96

Breakthrough Assessment

7/10

Provides a rigorous, counter-intuitive finding that simpler methods (Expert Iteration) beat PPO for reasoning, challenging the assumption that complex online RL is necessary for this domain.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning LLMs on reasoning tasks defined as distributions of (Question, Answer) tuples

Inputs: Natural language math questions Q

Outputs: Step-by-step reasoning chain and final answer A

Pipeline Flow

Input Question
Language Model (Student Policy)
Reasoning Trace Generation
Reward/Feedback Mechanism (Sparse or Learned ORM)

System Modules

Language Model (Student Policy)

Generate step-by-step reasoning solutions

Model or implementation: Llama-2-7B or Llama-2-13B (with LoRA)

Reward Mechanism

Evaluate correctness of generated solution

Model or implementation: Heuristic (Exact Match) or Learned ORM (Llama-2-7B classifier)

Novel Architectural Elements

Comparison framework integrating PPO, Expert Iteration, and RCRL with uniform SFT/Pretrained initialization protocols

Modeling

Base Model: Llama-2-7B and Llama-2-13B

Training Method: Expert Iteration (EI), Proximal Policy Optimization (PPO), Return-Conditioned RL (RCRL)

Objective Functions:

Purpose: PPO Policy Update.

Formally: Maximize clipped surrogate objective involving advantage estimation A_t and probability ratio r_t(theta)
Purpose: Expert Iteration Distillation.

Formally: Minimize negative log-likelihood (cross-entropy) on filtered correct samples: sum(-log(pi_theta(a_t|s_t)))
Purpose: Return-Conditioned RL.

Formally: Minimize negative log-likelihood conditioned on return G: sum(-log(pi_theta(a_t|s_t, g_t)))

Adaptation: LoRA (rank=128)

Trainable Parameters: LoRA adapters (r=128)

Key Hyperparameters:

learning_rate: 1e-6 (PPO), 2e-5 (SFT/EI)
batch_size: 256 (PPO)
ppo_epochs: 4
+ 3 more
kl_penalty: 0.05
sampling_temperature: 0.7 (PPO SFT init), 1.0 (EI exploration)
rollouts_per_prompt: 96 (EI), 4 (PPO)

Compute: Training takes 'about a day' for both methods; PPO has higher memory requirements

Comparison to Prior Work

vs. ReST: This paper compares EI directly to online PPO and investigates sample complexity in depth
vs. RFT: Investigates both sparse and dense rewards, and includes PPO comparison
vs. AlpacaFarm: Focuses specifically on reasoning tasks (math) rather than general instruction following
+ 1 more
vs. STaR [not cited in paper]: STaR iterates on self-generated reasoning; this paper's EI is similar but uses a fixed dataset initially and scales exploration breadth (K=96)

Limitations

Exploration is limited; models struggle to find solutions distinct from those in SFT data
Analysis is limited to math word problems (GSM8K, SVAMP), may not generalize to other reasoning types
PPO implementation uses LoRA, which might restrict expressivity compared to full fine-tuning
Dense rewards (step-level) provided no benefit or hurt performance compared to sparse outcome rewards

Reproducibility

Not provided. The paper does not explicitly link to a code repository or released model weights. It describes hyperparameters and datasets (GSM8K, SVAMP) which are public.

📊 Experiments & Results

Evaluation Setup

Math word problem solving on standard benchmarks

Benchmarks:

GSM8K (Grade school math word problems)
SVAMP (Math word problems with varying difficulty)

Metrics:

maj@1 (Greedy Accuracy)
maj@96 (Majority vote at 96 samples)
rerank@96 (Best of 96 selected by ORM)
pass@96 (Oracle best of 96)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of different RL algorithms on GSM8K using Llama-2-13B initialized from an SFT checkpoint.
GSM8K	maj@1	0.459	0.530	+0.071
GSM8K	maj@96	0.640	0.729	+0.089
GSM8K	pass@96	0.840	0.860	+0.020
Performance on SVAMP without SFT initialization (Pretrained 2-shot).
SVAMP	maj@1	0.05	0.69	+0.64

Experiment Figures

Maj@1 accuracy on GSM8K vs. training iterations for Expert Iteration

Maj@1 accuracy vs. number of model rollouts (samples) for PPO and EI

Main Takeaways

Expert Iteration consistently performs best across metrics and initializations, often beating PPO.
Sample complexity of Expert Iteration is surprisingly similar to PPO when normalized for samples per prompt.
RL fine-tuning improves both greedy accuracy (maj@1) and diversity (pass@96) simultaneously, whereas continued SFT degrades diversity.
Models fail to explore significantly beyond the solutions produced by SFT; pass@96 saturates early in RL training.
Dense rewards (intermediate supervision) do not provide significant benefits over sparse outcome-based rewards for these tasks.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) basics
Language Model Fine-tuning (SFT)
Proximal Policy Optimization (PPO)
Expert Iteration / Rejection Sampling

Key Terms

Expert Iteration: An algorithm where a model generates samples, correct ones are filtered (rejection sampling), and the model is fine-tuned on these correct samples

PPO: Proximal Policy Optimization—an online RL algorithm that updates a policy while limiting how much it changes from the previous version to ensure stability

SFT: Supervised Fine-Tuning—training a model on a fixed dataset of correct examples using standard cross-entropy loss

maj@1: Accuracy when checking the single greedy decoding output of the model

pass@96: The probability that at least one solution is correct when sampling 96 times from the model

maj@96: Accuracy when sampling 96 times and taking the majority vote of the final answers

rerank@96: Accuracy when sampling 96 times and selecting the best answer using a trained reward model (ORM)

ORM: Outcome-Based Reward Model—a model trained to predict if a partial or full solution will lead to a correct answer

RCRL: Return-Conditioned RL—training a model conditioned on a desired return (reward) token, then prompting it with the high-reward token at inference

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main model weights and trains small adapter matrices

KL penalty: A regularizer used in RL to prevent the trained policy from diverging too far from a reference model