Process-based self-rewarding language models

📝 Paper Summary

Mathematical Reasoning Self-Improvement / Self-Correction

The paper proposes a self-rewarding pipeline where language models generate their own training data by performing step-by-step reasoning and acting as a judge for individual steps, enabling iterative improvement in mathematical reasoning.

Core Problem

Existing self-rewarding methods work well for instruction following but fail or degrade performance in mathematical reasoning because outcome-based rewards are too coarse for complex, multi-step problems.

Why it matters:

Reliance on human-annotated preference data is expensive and constrained by human performance limits.
Current self-rewarding paradigms often lead to performance decline in math tasks as iterations increase.
Assigning a single score to a complex long-thought solution is difficult and has low consistency compared to step-wise verification.

Concrete Example: In standard self-rewarding, a model might generate a long math solution that is mostly correct but fails at step 5. If the final answer is wrong, the whole chain is penalized equally, or if the answer is coincidentally correct (false positive), the bad logic is reinforced. This lack of granularity prevents effective learning.

Key Novelty

Process-based Self-Rewarding (Step-wise Judge + Step-wise DPO)

Integrates 'LLM-as-a-Judge' at the individual reasoning step level, rather than judging the final answer.
Uses Monte Carlo Tree Search (MCTS) guided by the model's own step-wise judgments to generate preference pairs (best vs. worst steps).
Applies step-wise Direct Preference Optimization (DPO) to iteratively refine the model's reasoning process.

Architecture

The iterative training pipeline of Process-based Self-Rewarding Language Models.

Evaluation Highlights

Process-based Self-Rewarding outperforms the base model significantly on mathematical benchmarks (e.g., +6.8% on GSM8K, +3.7% on MATH for 7B model).
Demonstrates iterative improvement: performance consistently increases across multiple self-rewarding iterations (M0 -> M1 -> M2 -> M3).
The method improves both mathematical reasoning capability and the model's ability to act as a judge (meta-rewarding capability).

Breakthrough Assessment

8/10

Significantly addresses the failure of previous self-rewarding methods in math domains by shifting to process supervision. The iterative gains without external supervision are a strong signal for super-human scaling potential.

⚙️ Technical Details

Problem Definition

Setting: Iterative self-training of Large Language Models for mathematical reasoning.

Inputs: Mathematical problems (prompts x).

Outputs: Step-by-step reasoning chains (s_1, ..., s_l) ending in a final answer.

Pipeline Flow

Initialization: Train base model on seed IFT (Reasoning) and EFT (Judging) data.
Generation: Model generates reasoning steps using MCTS and self-evaluates candidate steps.
Selection: Construct preference pairs (best step vs. worst step) from the search tree.
Optimization: Train model using Step-wise DPO on generated pairs.
Iteration: Repeat Generation and Optimization with the updated model.

System Modules

Reasoning Generator

Generates candidate next-steps for a math problem.

Model or implementation: Qwen2.5-Math (7B or 72B)

Step-wise Judge

Evaluates candidate steps pairwise to determine which is better.

Model or implementation: Same as Generator (Self-Rewarding)

Preference Selector

Selects the best and worst steps from candidates to form training pairs.

Model or implementation: Search Algorithm (MCTS logic)

Novel Architectural Elements

Unified Reasoning and Process-Judging: The same model weights perform both generation and step-level judging within a loop.
Step-wise Self-Rewarding Loop: Unlike standard self-rewarding (response-level), this architecture embeds the judge inside the generation search (MCTS) to create step-level signals.

Modeling

Base Model: Qwen2.5-Math-7B and Qwen2.5-Math-72B

Training Method: Iterative Step-wise Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Optimize the policy to prefer chosen steps over rejected steps while staying close to the reference model.

Formally: DPO loss function applied at the step level.

Adaptation: Full fine-tuning

Trainable Parameters: Full parameters

Training Data:

Seed IFT: 28,889 NuminaMath samples processed into step-by-step format by GPT-o1.
Seed EFT: 4,167 pairwise judgment samples filtered from PRM800k using a trained Qwen2.5-72B PRM and GPT-o1 annotation.
Iterative PPD: Generated by the model itself using 400, 800, and 1200 math questions for subsequent iterations.

Key Hyperparameters:

learning_rate: 5e-7 (DPO), 1e-6 (SFT initialization)
batch_size: 32
beta: Not explicitly reported in the paper (DPO parameter)
+ 4 more
search_width: 6
max_iteration_number: 20
temperature: 0.5
top_p: 0.95

Compute: 32 NVIDIA H100 GPUs for training/fine-tuning. Initial PRM training used 128 NVIDIA A100 GPUs.

Comparison to Prior Work

vs. Yuan et al. (2024): Applies rewards at the *step* level rather than the *response* level; uses pairwise comparison for steps rather than absolute scoring.
vs. PRM approaches: Does not require a separate frozen reward model; the policy model evolves to become its own better judge.
vs. Meta-Rewarding: Focuses specifically on mathematical reasoning chains rather than general instruction following.

Limitations

Relies on GPT-o1 for high-quality seed data initialization (cold start).
Computational cost of inference-time search (MCTS) and self-evaluation is high compared to direct generation.
Experiments limited to math domain; generalization to other reasoning tasks (coding, logic) not tested.

Reproducibility

Code: https://github.com/Shimao-Zhang/Process-Self-Rewarding

Code and data available at https://github.com/Shimao-Zhang/Process-Self-Rewarding. Uses OpenAI GPT-o1 for initial data synthesis (closed source dependency).

📊 Experiments & Results

Evaluation Setup

Evaluated on mathematical reasoning capability and LLM-as-a-Judge capability across iterations.

Benchmarks:

GSM8K (Grade School Math)
MATH (Challenging Math Problems)
Gaokao2023En (College Entrance Exam Math)
OlympiadBench (Math Olympiad Problems)
AIME2024 (Math Competition)
AMC2023 (Math Competition)

Metrics:

Accuracy (Math)
Accuracy (Judge consistency with human/oracle)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance gains of the Process-based Self-Rewarding method (M3 iteration) compared to the base Qwen2.5-Math-7B-Instruct model.
GSM8K	Accuracy	82.9	89.7	+6.8
MATH	Accuracy	73.2	76.9	+3.7
Gaokao2023En	Accuracy	63.0	70.1	+7.1
OlympiadBench	Accuracy	39.5	44.1	+4.6
AIME2024	Accuracy	13.3	16.7	+3.4
AMC2023	Accuracy	50.0	52.5	+2.5
Ablation study demonstrating the effectiveness of the proposed components compared to standard self-rewarding.
MATH	Accuracy	70.0	76.9	+6.9

Main Takeaways

Iterative improvement: Accuracy consistently increases from M0 to M3, validating the self-rewarding loop.
Process vs. Outcome: Standard outcome-based self-rewarding fails in math (performance degrades or stagnates), while process-based self-rewarding succeeds.
Joint Capability: The model improves its ability to *judge* reasoning steps alongside its ability to *generate* them.
Scaling: Effective on both 7B and 72B parameter scales (72B results show gains, though less dramatic relative improvement than 7B due to high baseline).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Chain-of-Thought (CoT) Reasoning
Direct Preference Optimization (DPO)

Key Terms

Self-Rewarding Language Models: A paradigm where a single model acts as both the instruction-following policy and the reward model to generate its own training data.

LLM-as-a-Judge: Using a Large Language Model to evaluate the quality of text outputs, often by acting as a pairwise comparator.

Process Reward Model (PRM): A reward model that evaluates reasoning steps individually rather than just the final outcome.

MCTS: Monte Carlo Tree Search—a search algorithm used to explore reasoning paths by simulating future outcomes.

DPO: Direct Preference Optimization—a method to fine-tune models on preference pairs without training an explicit reward model first.

IFT: Instruction Fine-Tuning—supervised training on demonstration data.

EFT: Evaluation Fine-Tuning—supervised training to teach the model how to judge/evaluate responses.

PPD: Pair-wise Preference Data—data consisting of 'chosen' and 'rejected' sample pairs used for training.