← Back to Paper List

Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D. Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M. Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, D. Precup, Feryal M. P. Behbahani, Aleksandra Faust
Google DeepMind
International Conference on Learning Representations (2024)
RL Reasoning Benchmark

📝 Paper Summary

Intrinsic Self-Correction Reinforcement Learning for Reasoning
SCoRe teaches LLMs to self-correct by using multi-turn reinforcement learning on self-generated data, employing a two-stage training process to prevent the model from collapsing into a strategy of simply generating the best first response.
Core Problem
Modern LLMs struggle to correct their own mistakes without external feedback (intrinsic self-correction), often failing to improve or even degrading correct answers during revision.
Why it matters:
  • Current self-correction methods rely on oracle feedback or separate teacher models, which are not available in real-world test settings.
  • Supervised fine-tuning (SFT) approaches suffer from distribution shift (mismatch between training data and model's own errors) or behavior collapse (learning to minimize edits rather than fix errors).
  • Achieving reliable self-correction is essential for LLMs to implement meta-strategies for complex reasoning tasks like math and coding.
Concrete Example: When asked to solve a math problem, an SFT-trained model might produce a correct first answer but then change it to an incorrect one in the second turn (behavior collapse), or fail to correct a mistake because the error distribution in the static training data differs from its own current errors.
Key Novelty
SCoRe (Self-Correction via Reinforcement Learning)
  • Trains on the model's own self-generated distribution of traces (on-policy) to avoid distribution mismatch seen in offline SFT.
  • Uses a two-stage training process: Stage I initializes a policy that decouples the first and second attempts (preventing collapse), and Stage II optimizes both attempts using reward shaping.
  • Reward shaping in Stage II explicitly incentivizes 'progress' (improving from incorrect to correct) rather than just final answer correctness.
Evaluation Highlights
  • +15.6% improvement in intrinsic self-correction (delta between first and second attempt) on MATH using Gemini 1.5 Flash compared to the base model.
  • +9.1% improvement in intrinsic self-correction on HumanEval using Gemini 1.0 Pro compared to the base model.
  • Achieves positive self-correction deltas (+4.4% on MATH), whereas baselines like STaR and Self-Refine often yield negligible or negative improvement.
Breakthrough Assessment
9/10
Significantly positive intrinsic self-correction results are rare in the literature. SCoRe identifies and solves the critical 'behavior collapse' failure mode of previous methods.
×