ReFT: Reasoning with Reinforced Fine-Tuning

📝 Paper Summary

Mathematical Reasoning Fine-tuning Large Language Models Reinforcement Learning for Reasoning

REFT fine-tunes language models for math reasoning by using PPO to learn from multiple automatically sampled reasoning paths rather than just single ground-truth annotations.

Core Problem

Supervised Fine-Tuning (SFT) relies on a single annotated Chain-of-Thought (CoT) path per question, limiting the model's ability to explore diverse valid reasoning strategies and generalize to new problems.

Why it matters:

Math problems often have multiple valid reasoning paths, but training data usually provides only one, restricting the model's learning potential.
SFT models often struggle with generalization; exploring alternative paths via reinforcement learning can provide richer supervision signals.
Reliance on fixed CoT annotations can lead to overfitting on specific phrasing rather than learning robust problem-solving logic.

Concrete Example: For a question like 'Weng earns $12/hour...', SFT trains on one specific explanation. If the model generates a valid but different path (e.g., converting minutes to hours differently), SFT doesn't reward it, whereas ReFT samples multiple paths and rewards any that reach the correct numeric answer.

Key Novelty

Reinforced Fine-Tuning (ReFT) for Math

Warm-up the model with standard SFT, then switch to PPO (Reinforcement Learning) using the same training questions but without the ground-truth reasoning paths.
Derive rewards automatically by checking if the final answer matches the ground truth, allowing the model to explore and learn from any valid reasoning path it generates.

Architecture

Conceptual comparison between SFT and ReFT processes. SFT trains on fixed (question, CoT, answer) triples. ReFT warms up with SFT, then uses RL (PPO) to sample multiple CoTs (e'), compare their answers (y') to the gold answer (y) to generate rewards, and update the policy.

Evaluation Highlights

ReFT outperforms SFT by +9.71% accuracy on GSM8K using CodeLLAMA-7B with natural language CoT.
Achieves 81.2% accuracy on GSM8K using CodeLLAMA-7B with Program-CoT + Reranking, surpassing larger models like MAmmoTH-Coder-70B (76.7%).
Consistent improvements across GSM8K, SVAMP, and MathQA datasets using both Galactica and CodeLLAMA foundation models.

Breakthrough Assessment

8/10

Simple yet highly effective method that significantly boosts reasoning performance using existing data, without requiring external reward models or extra datasets.

⚙️ Technical Details

Problem Definition

Setting: Math word problem solving using Chain-of-Thought (CoT) generation

Inputs: Math question x

Outputs: Reasoning path e followed by final answer y

Pipeline Flow

Warm-up Stage (SFT)
Reinforcement Learning Stage (PPO)

System Modules

Policy Model (Warm-up) (Training)

Initialize the model with basic reasoning capabilities using standard SFT on (question, CoT) pairs

Model or implementation: CodeLLAMA-7B or Galactica-6.7B

Policy Model (RL) (Training)

Explore diverse reasoning paths and update weights based on answer correctness

Model or implementation: Same as Warm-up model

Novel Architectural Elements

Integration of PPO directly after SFT warm-up on the *same* dataset without external reward models, relying solely on ground-truth answer verification for rewards

Modeling

Base Model: CodeLLAMA-7B and Galactica-6.7B

Training Method: PPO (Proximal Policy Optimization)

Objective Functions:

Purpose: Maximize expected reward while limiting policy change.

Formally: L_policy using clipped surrogate objective.
Purpose: Minimize error in value estimation.

Formally: L_value = MSE(V_phi(s), R_t).
Purpose: Unified loss.

Formally: L_RL = L_policy + alpha * L_value.

Adaptation: Full fine-tuning

Key Hyperparameters:

warmup_epochs: 2 (mostly)
rl_epochs: 300
learning_rate: 3e-7 (RL stage)
+ 5 more
kl_coefficient_beta: 0.01 (P-CoT) / 0.05 (N-CoT)
ppo_clip_epsilon: 0.2
gae_lambda: 0.95
discount_factor_gamma: 1
batch_size: 48

Compute: 8 A100-80GB GPUs

Comparison to Prior Work

vs. Offline/Online Self-Training: ReFT uses PPO to learn from both positive (correct answer) and negative (incorrect answer) signals, and uses KL constraints to prevent collapse, whereas self-training typically only learns from positive samples.
vs. Reward Model Reranking: ReFT improves the policy model itself to generate better answers, whereas reranking requires generating many samples and filtering them at inference time (though ReFT can be combined with reranking).
vs. RFT (Reinforced Fine-Tuning by Yuan et al.) [not cited in paper]: This paper (ReFT) uses PPO, whereas RFT typically refers to Expert Iteration or rejection sampling fine-tuning approaches without online RL updates.

Limitations

Susceptible to reward hacking on multiple-choice questions (e.g., MathQA), where the model guesses the correct option label with invalid reasoning.
Requires ground truth answers to compute rewards, limiting applicability to open-ended tasks without definitive answers.
Training is computationally more expensive and slower to converge than SFT (e.g., 300 epochs vs 40 epochs).

Reproducibility

Code: https://github.com/lqtrung1998/mwp_ReFT

Code is publicly available. Datasets (GSM8K, MathQA, SVAMP) are standard public benchmarks. Hyperparameters are detailed in the paper.

📊 Experiments & Results

Evaluation Setup

Math word problem solving with CoT generation

Benchmarks:

GSM8K (Grade school math word problems)
SVAMP (Math word problems with varying difficulty)
MathQA (Complex math problems (multiple choice))

Metrics:

Accuracy (exact match of final numeric answer)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ReFT consistently outperforms SFT across different datasets and model architectures (Galactica, CodeLLAMA).
GSM8K	Accuracy (N-CoT)	43.59	53.30	+9.71
GSM8K	Accuracy (P-CoT)	63.68	75.28	+11.60
SVAMP	Accuracy (P-CoT)	75.40	79.19	+3.79
MathQA MCQ	Accuracy (N-CoT)	56.01	60.13	+4.12
Inference-time strategies like Majority Voting and Reranking further boost ReFT's performance.
GSM8K	Accuracy (P-CoT)	77.0	81.2	+4.2

Experiment Figures

Training dynamics of ReFT on GSM8K P-CoT: Mean Training Reward, Evaluation Accuracy, and Sequence KL divergence over epochs.

Impact of different warm-up lengths on ReFT performance.

Main Takeaways

ReFT significantly improves generalization by exploring multiple reasoning paths during training, unlike SFT which is limited to single annotations.
The method works effectively for both Natural Language CoT (N-CoT) and Program CoT (P-CoT).
ReFT is compatible with and enhanced by inference-time techniques like majority voting and reranking.
Reward hacking can occur in multiple-choice settings (MathQA), where the model learns to output the correct choice label despite incorrect reasoning.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Supervised Fine-Tuning (SFT)
Reinforcement Learning (specifically PPO)

Key Terms

CoT: Chain-of-Thought—intermediate reasoning steps generated by the model before producing the final answer

PPO: Proximal Policy Optimization—a reinforcement learning algorithm that updates the model policy while preventing drastic deviations from the previous policy

SFT: Supervised Fine-Tuning—training the model on a fixed dataset of (question, reasoning path, answer) tuples using standard language modeling loss

N-CoT: Natural Language Chain-of-Thought—reasoning steps expressed in plain text

P-CoT: Program-based Chain-of-Thought—reasoning steps expressed as executable Python code

majority voting: An inference strategy where the model generates multiple solutions and selects the most frequent answer

reward model reranking: An inference strategy where a separate model scores multiple generated solutions, and the highest-scoring one is selected

KL divergence: A penalty term used in RL to ensure the updated model does not drift too far from the reference (initial) model