Improve Mathematical Reasoning in Language Models by Automated Process Supervision

📝 Paper Summary

Mathematical Reasoning Reward Modeling Process Supervision

OmegaPRM automates process supervision for math reasoning by using a divide-and-conquer Monte Carlo Tree Search to efficiently locate errors and generate over 1.5 million high-quality training labels without human intervention.

Core Problem

Training Process Reward Models (PRMs) requires granular labels for every reasoning step, but collecting these labels relies on expensive human annotation or inefficient brute-force Monte Carlo estimation.

Why it matters:

Outcome Reward Models (ORMs) provide sparse feedback, failing to identify where exactly a multi-step reasoning chain goes wrong
Current automated methods like brute-force rollouts are computationally expensive (linear complexity with respect to solution length)
Scalable, high-quality process data is the primary bottleneck for improving LLM reasoning capabilities beyond simple prompting

Concrete Example: In a long math proof, if an LLM makes a mistake at step 4 but the final answer is wrong, an ORM only knows the result is wrong. Finding the exact error requires checking every step. Brute-force methods roll out from step 1, 2, 3... to the end, which is very costly for long chains.

Key Novelty

OmegaPRM: Divide-and-Conquer MCTS for Data Collection

Uses binary search to locate the first error in a solution chain with logarithmic complexity instead of linear scanning
Maintains a state-action tree where nodes store Monte Carlo correctness estimates, allowing efficient reuse of rollouts across different branches
Selects 'convincing wrong' rollouts (high confidence but wrong answer) for annotation to create harder, more valuable training examples

Architecture

Overview of the OmegaPRM algorithm showing the MCTS process. It illustrates how a question is expanded into a tree of partial solutions, how rollouts are performed, and how binary search is used to locate errors.

Evaluation Highlights

Gemini Pro success rate improved from 51% to 69.4% on MATH500 and 86.4% to 93.6% on GSM8K using the collected data
Gemma2 27B success rate boosted from 42.3% to 58.2% on MATH500 and 74.0% to 92.2% on GSM8K
Collected over 1.5 million automated process supervision annotations without any human intervention

Breakthrough Assessment

8/10

Significant efficiency gain in data collection (O(log N) vs O(N)) allowing massive scaling of process supervision data. The resulting performance gains on major benchmarks are substantial.

⚙️ Technical Details

Problem Definition

Setting: Mathematical problem solving with step-by-step reasoning

Inputs: Math question q

Outputs: Step-by-step solution x = (x_1, x_2, ..., x_T)

Pipeline Flow

Generator Policy (Proposes solution steps)
OmegaPRM Tree Search (Builds tree of reasoning paths to collect data)
Binary Search Auditor (Efficiently locates first error in paths)
PRM Trainer (Trains reward model on collected data)
Inference: Weighted Self-Consistency (Uses PRM to score/rank solutions)

System Modules

Generator Policy

Generate candidate steps and complete rollouts

Model or implementation: Instruction-tuned Gemini Pro / Gemma2 27B

OmegaPRM Tree Search (Data Annotation)

Construct a state-action tree to identify valid and invalid reasoning paths efficiently

Model or implementation: MCTS algorithm (Algorithm 1)

Binary Search Auditor (Data Annotation)

Locate the first error in a selected rollout using binary splitting

Model or implementation: Algorithm logic

Process Reward Model (PRM)

Predict correctness probability of individual reasoning steps

Model or implementation: PaLM 2-S / Gemma2 27B / Gemini Pro (fine-tuned)

Novel Architectural Elements

Divide-and-conquer MCTS node expansion: Nodes are created by binary search splitting rather than single-step expansion
Heuristic selection function prioritizing 'convincing wrong-answer' rollouts (high MC value but incorrect final answer) to mine hard negatives

Modeling

Base Model: Instruction-tuned Gemini Pro, Gemma2 27B, and PaLM 2-S

Training Method: Supervised Fine-Tuning (classification)

Objective Functions:

Purpose: Train PRM to predict step correctness using binary labels.

Formally: standard classification loss minimizing cross-entropy between predicted score and binary label (1 if MC > 0, else 0).
Purpose: Train PRM using soft labels (optional).

Formally: minimize distance between predicted score and Monte Carlo estimate value.

Training Data:

1.5 million process supervision annotations collected via OmegaPRM
Questions from MATH dataset training set

Key Hyperparameters:

MCTS_alpha: 0.5
MCTS_beta: 0.9
MCTS_L: 500
+ 1 more
MCTS_c_puct: 0.125

Compute: Not reported in the paper

Comparison to Prior Work

vs. Math-Shepherd: Uses O(log N) binary search for error location instead of O(N) linear scan
vs. MiPS: Introduces tree structure to reuse rollouts across branches rather than independent rollouts
vs. Uesato et al. (2022) / Lightman et al. (2023): Fully automated data collection without human annotators

Limitations

Relies on a sufficiently capable 'completer' policy; if the model cannot solve the problem at all, no signal is generated
Binary search assumes monotonicity of correctness (once a step is wrong, subsequent steps cannot 'fix' it to be logically correct)
Computational cost of MCTS is still significant compared to simple self-consistency, though more efficient than brute-force

Reproducibility

Code availability is not provided. The method relies on Gemini Pro and PaLM 2 models which are proprietary. The dataset of 1.5 million annotations is described but no download link is provided.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks solving word problems and producing step-by-step solutions

Benchmarks:

MATH500 (Challenging math problems (subset of MATH test set))
GSM8K (Grade school math word problems)

Metrics:

Success Rate (Accuracy)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MATH500	Success Rate	51.0	69.4	+18.4
GSM8K	Success Rate	86.4	93.6	+7.2
MATH500	Success Rate	42.3	58.2	+15.9
GSM8K	Success Rate	74.0	92.2	+18.2
Ablation study comparing different labeling strategies for training the PRM.
MATH500	Success Rate	66.2	69.4	+3.2

Experiment Figures

Comparison of Brute-force Monte Carlo estimation vs. Binary Search estimation

Main Takeaways

Process supervision significantly outperforms outcome supervision (ORM) and standard self-consistency baselines
Automated data collection via OmegaPRM scales effectively, gathering 1.5M+ labels without humans
Pointwise soft labels (using raw MC probability) perform better than hard binary labels or pairwise preference learning for PRM training
The method generalizes across model sizes, showing gains for both Gemini Pro and the smaller Gemma2 27B

📚 Prerequisite Knowledge

Prerequisites

Monte Carlo Tree Search (MCTS)
Process Reward Models (PRM) vs Outcome Reward Models (ORM)
Chain-of-Thought (CoT) prompting

Key Terms

PRM: Process Reward Model—a model trained to score the correctness of each intermediate step in a solution

ORM: Outcome Reward Model—a model trained to score only the final answer of a solution

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps

MCTS: Monte Carlo Tree Search—a search algorithm that builds a decision tree by randomly sampling future outcomes (rollouts) to estimate the value of current states

Rollout: Completing a partial solution by letting the model generate steps until it reaches a final answer

PUCT: Predictor + Upper Confidence Bound applied to Trees—a selection formula used in MCTS to balance exploitation (high value) and exploration (low visit count)