Training Language Models to Self-Correct via Reinforcement Learning

📝 Paper Summary

Intrinsic Self-Correction Reinforcement Learning for Reasoning

SCoRe teaches LLMs to self-correct by using multi-turn reinforcement learning on self-generated data, employing a two-stage training process to prevent the model from collapsing into a strategy of simply generating the best first response.

Core Problem

Modern LLMs struggle to correct their own mistakes without external feedback (intrinsic self-correction), often failing to improve or even degrading correct answers during revision.

Why it matters:

Current self-correction methods rely on oracle feedback or separate teacher models, which are not available in real-world test settings.
Supervised fine-tuning (SFT) approaches suffer from distribution shift (mismatch between training data and model's own errors) or behavior collapse (learning to minimize edits rather than fix errors).
Achieving reliable self-correction is essential for LLMs to implement meta-strategies for complex reasoning tasks like math and coding.

Concrete Example: When asked to solve a math problem, an SFT-trained model might produce a correct first answer but then change it to an incorrect one in the second turn (behavior collapse), or fail to correct a mistake because the error distribution in the static training data differs from its own current errors.

Key Novelty

SCoRe (Self-Correction via Reinforcement Learning)

Trains on the model's own self-generated distribution of traces (on-policy) to avoid distribution mismatch seen in offline SFT.
Uses a two-stage training process: Stage I initializes a policy that decouples the first and second attempts (preventing collapse), and Stage II optimizes both attempts using reward shaping.
Reward shaping in Stage II explicitly incentivizes 'progress' (improving from incorrect to correct) rather than just final answer correctness.

Evaluation Highlights

+15.6% improvement in intrinsic self-correction (delta between first and second attempt) on MATH using Gemini 1.5 Flash compared to the base model.
+9.1% improvement in intrinsic self-correction on HumanEval using Gemini 1.0 Pro compared to the base model.
Achieves positive self-correction deltas (+4.4% on MATH), whereas baselines like STaR and Self-Refine often yield negligible or negative improvement.

Breakthrough Assessment

9/10

Significantly positive intrinsic self-correction results are rare in the literature. SCoRe identifies and solves the critical 'behavior collapse' failure mode of previous methods.

⚙️ Technical Details

Problem Definition

Setting: Intrinsic self-correction where a policy maps input tokens to output tokens over multiple turns (attempts) without external feedback.

Inputs: A problem x (e.g., math question) and optional instruction p

Outputs: A sequence of attempts y_1, y_2, ..., where the goal is to maximize the reward of the final attempt.

Pipeline Flow

Stage I: Initialization (Decoupling Attempts)
Stage II: Multi-turn RL with Reward Shaping

System Modules

Stage I Training (Training)

Train a policy initialization that decouples first and second attempts to prevent collapse.

Model or implementation: Base LLM (Gemini 1.0 Pro or 1.5 Flash)

Stage II Training (Training)

Jointly optimize performance of both attempts using shaped rewards.

Model or implementation: Policy from Stage I

Novel Architectural Elements

Two-stage RL pipeline specifically designed to initialize a 'decoupled' policy before joint optimization
Reward shaping term that reinforces the *difference* in correctness between turns (progress) rather than just absolute correctness

Modeling

Base Model: Gemini 1.0 Pro (for coding) and Gemini 1.5 Flash (for MATH)

Training Method: Multi-turn Reinforcement Learning (REINFORCE)

Objective Functions:

Purpose: Stage I objective.

Formally: Maximize E[r(y2, y*) - beta2 * D_KL(pi(.|x) || pi_ref(.|x)) (on y1 only)].
Purpose: Stage II objective.

Formally: Maximize E[Sum(r(yi, y*)) - beta1 * D_KL] with shaped reward bonus b(y2|y1) = alpha * (r(y2) - r(y1)).

Training Data:

MATH training set augmented with 4500 problems from test set (reporting on MATH500)
MBPP training set (reporting on HumanEval)

Key Hyperparameters:

beta1: KL penalty coefficient (standard)
beta2: Stage I KL penalty coefficient (strong constraint on turn 1)
alpha: Reward shaping multiplier (positive constant > 1.0)

Compute: Not reported in the paper

Comparison to Prior Work

vs. STaR: SCoRe uses multi-turn RL instead of SFT, preventing behavior collapse where the model learns to just ignore the second turn.
vs. Pair-SFT: SCoRe trains on-policy (self-generated data) rather than offline data, addressing distribution shift.
vs. Self-Refine: SCoRe modifies model weights via RL rather than relying solely on in-context learning/prompting.

Limitations

Only trained for one round of iterative self-correction (two turns total) due to infrastructure constraints.
Requires access to reward function (ground truth) during training.
Infrastructure complexity of multi-turn RL compared to simple SFT.
Experiments limited to reasoning (MATH) and coding (HumanEval/MBPP) domains.

Reproducibility

Prompt templates and self-correction instructions are provided in Appendix C. Code is not provided. Model weights for Gemini 1.0 Pro/1.5 Flash are not open. Detailed hyperparameters (learning rates, batch sizes) are not explicitly listed in the main text.

📊 Experiments & Results

Evaluation Setup

Intrinsic self-correction: model generates solution, then is prompted to correct it without external feedback.

Benchmarks:

MATH (Mathematical problem solving)
HumanEval (Code generation)
MBPP-R (Offline code repair)

Metrics:

Accuracy@t1 (First attempt accuracy)
Accuracy@t2 (Second attempt accuracy)
Delta(t1, t2) (Net improvement)
Delta i->c (Incorrect to Correct fraction)
Delta c->i (Correct to Incorrect fraction)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MATH	Delta(t1, t2)	-11.2%	4.4%	+15.6%
MATH	Accuracy@t2	41.4%	64.4%	+23.0%
HumanEval	Delta(t1, t2)	3.0%	12.2%	+9.2%
HumanEval	Accuracy@t2	56.7%	64.6%	+7.9%
MATH	Delta(t1, t2)	2.2%	4.4%	+2.2%

Main Takeaways

SFT methods (STaR, Pair-SFT) often fail at self-correction due to distribution shift or behavior collapse (learning to not edit).
SCoRe effectively mitigates behavior collapse through its two-stage design and reward shaping.
Sequential self-correction with SCoRe can be more compute-efficient than parallel sampling (Self-Consistency): 10.5% gain with SCoRe vs 7.4% with parallel sampling for the same budget.
On-policy training (RL) is crucial; replacing REINFORCE with STaR in Stage II degrades performance significantly.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradient/REINFORCE)
Supervised Fine-Tuning (SFT)
KL Divergence
Large Language Models (LLMs)

Key Terms

SCoRe: Self-Correction via Reinforcement Learning—the proposed multi-turn RL method for training intrinsic self-correction.

intrinsic self-correction: The ability of a model to correct its own mistakes without any external feedback (like ground truth or human hints).

behavior collapse: A failure mode where the model learns to produce the best possible first response and then makes no substantive edits in the second turn, effectively ignoring the self-correction instruction.

SFT: Supervised Fine-Tuning—training a model on a fixed dataset of examples.

STaR: Self-Taught Reasoner—an iterative training method where a model generates reasoning traces, filters for correct ones, and fine-tunes on them.

REINFORCE: A policy gradient algorithm in Reinforcement Learning that updates model weights based on the reward received for generated actions.

KL divergence: Kullback-Leibler divergence—a statistical measure used here to penalize the model for deviating too far from a reference policy (usually the base model).

on-policy: Training using data generated by the current version of the model being trained, ensuring the data distribution matches the model's behavior.

reward shaping: Modifying the reward function (e.g., adding a bonus for improvement) to guide the learning process toward desired behaviors.

edit distance: A metric measuring how dissimilar two strings are; used here to quantify how much the model changes its answer between attempts.