Let's Verify Step by Step

📝 Paper Summary

Mathematical reasoning Reward modeling Process supervision

Process supervision (rewarding individual reasoning steps) significantly outperforms outcome supervision (rewarding final answers) for training reliable reward models in complex mathematical reasoning.

Core Problem

Large language models often produce logical mistakes and hallucinations in multi-step reasoning tasks, and outcome-based feedback is insufficient for precise credit assignment.

Why it matters:

Single logical errors can derail entire solutions in complex domains like math
Outcome supervision (checking only the final answer) provides sparse feedback and struggles with credit assignment
Models trained on outcomes may learn to reach correct answers via incorrect reasoning (misalignment)
Human feedback is expensive, making efficient data collection strategies critical

Concrete Example: A model might output a solution that makes a calculation error in step 3 but accidentally arrives at the correct final answer. An outcome-supervised model would label this 'correct', reinforcing bad logic. A process-supervised model would flag step 3 as incorrect.

Key Novelty

Large-Scale Process Supervision with Active Learning

Train a Process-supervised Reward Model (PRM) using a massive dataset (PRM800K) of 800,000 human-labeled intermediate steps
Use active learning to select 'convincing wrong-answer' solutions (high PRM score but wrong final answer) for human labeling, maximizing data efficiency
Define the score of a whole solution as the probability that *every* step is correct (product of step probabilities)

Architecture

Screenshot of the data collection interface showing a step-by-step solution with human-assigned labels (positive/neutral/negative).

Evaluation Highlights

Process-supervised Reward Model (PRM) solves 78.2% of problems on a representative subset of the MATH test set (Best-of-N search)
Outcome-supervised Reward Model (ORM) solves 72.4% on the same subset, despite being trained on more (but outcome-only) data
Active learning improves data efficiency by approximately 2.6x compared to uniform data labeling

Breakthrough Assessment

9/10

Establishment of process supervision as clearly superior to outcome supervision for reasoning, backed by a massive released dataset (PRM800K) and state-of-the-art results on MATH.

⚙️ Technical Details

Problem Definition

Setting: Multi-step mathematical reasoning where a generator produces a chain-of-thought solution

Inputs: A math problem from the MATH dataset

Outputs: A step-by-step solution ending in a final answer

Pipeline Flow

Generator (GPT-4 based) produces N step-by-step solutions
Reward Model (PRM or ORM) scores the solutions
Selector chooses the highest-scoring solution as the final answer

System Modules

Generator

Generate candidate solutions in a newline-delimited step-by-step format

Model or implementation: GPT-4 (fine-tuned on MathMix and MATH training data)

Process-supervised Reward Model (PRM) (Evaluation)

Assign a probability of correctness to each individual step in a solution

Model or implementation: GPT-4 (fine-tuned on PRM800K)

Outcome-supervised Reward Model (ORM) (Evaluation)

Predict whether a full solution is correct based only on the final answer

Model or implementation: GPT-4 (fine-tuned on generator samples with binary outcome labels)

Modeling

Base Model: GPT-4 (pretrained on next-token prediction, no RLHF pre-training)

Training Method: Supervised Fine-Tuning (SFT) for Generator; Classification Fine-Tuning for Reward Models

Objective Functions:

Purpose: Train PRM to predict step correctness.

Formally: Maximize log-likelihood of target tokens (positive/negative/neutral) after the last token of each step.

Trainable Parameters: Full fine-tuning (implied by 'finetune all models from GPT-4')

Training Data:

PRM800K: 800,000 step-level labels across 75,000 solutions
MathMix: 1.5B math-relevant tokens for domain adaptation
ORM training set: 100 uniform samples per problem (no overlap with PRM800K)

Key Hyperparameters:

MathMix_tokens: 1.5B (large scale), 3B (training duration 2 epochs)
PRM_epochs: 2
Generator_finetuning_epochs: 1
+ 1 more
ORM_samples_per_problem: 100

Compute: Not reported in the paper

Comparison to Prior Work

vs. Uesato et al.: Evaluates on harder MATH dataset; uses much larger/more capable base model (GPT-4); collects significantly more feedback data (800k labels)
vs. Majority Voting: PRM explicitly verifies reasoning steps rather than relying on consensus of sampled answers
vs. Minerva: Uses active learning and process supervision rather than just domain-adaptive pretraining and majority voting

Limitations

Comparison between large-scale ORM and PRM is not apples-to-apples due to different training sets (active learning vs uniform)
Test set contamination is possible (MathMix may overlap with MATH test set)
Evaluation uses a subset of the MATH test set (500 problems) rather than the full set
Neutral labels in process supervision introduce ambiguity handled by heuristics

Reproducibility

Code: https://github.com/openai/prm800k

PRM800K dataset is publicly available. Base model is GPT-4 (proprietary). MathMix dataset details provided in Appendix A but data not released.

📊 Experiments & Results

Evaluation Setup

Mathematical problem solving on the MATH dataset

Benchmarks:

MATH (Multi-step mathematical reasoning (subsampled 500 test problems))
STEM OOD (Out-of-distribution STEM questions (AP Physics, Calculus, Chemistry, AMC10/12)) [New]

Metrics:

Percentage of problems solved (Best-of-N)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MATH (representative subset)	% Solved (Best-of-N)	72.4	78.2	+5.8
MATH (representative subset)	% Solved (Best-of-N)	69.6	78.2	+8.6
Aggregate STEM OOD	% Problems Solved (Best-of-100)	63.8	72.9	+9.1
Aggregate STEM OOD	% Problems Solved (Best-of-100)	61.3	72.9	+11.6
MATH (small-scale ablation)	Data Efficiency Multiplier	1.0	2.6	+1.6

Experiment Figures

Performance comparison (% Problems Solved) of ORM, PRM, and Majority Voting as a function of the number of solutions N (log scale).

Small-scale ablation comparing PRM vs ORM (trained with different supervision sources) across varying training set sizes.

Main Takeaways

Process supervision (PRM) reliably outperforms outcome supervision (ORM) and majority voting, with the gap widening as the number of search samples (N) increases.
The 'alignment tax' is negative; safer, more interpretable process supervision actually yields higher performance than outcome supervision.
Active learning (surfacing convincing wrong-answer solutions to labelers) drastically improves data efficiency compared to uniform sampling.
Using a large PRM to supervise smaller models (synthetic supervision) mimics human data collection trends, validating small-scale experiments.

📚 Prerequisite Knowledge

Prerequisites

Language Model fine-tuning
Reinforcement Learning from Human Feedback (RLHF) concepts
Best-of-N search (rejection sampling)

Key Terms

PRM: Process-supervised Reward Model—a model trained to predict the correctness of each intermediate step in a solution

ORM: Outcome-supervised Reward Model—a model trained to predict the correctness of the final result of a solution

MathMix: A pretraining dataset of roughly 1.5B math-relevant tokens used to improve the base model's mathematical reasoning

Active Learning: A data collection strategy where the model selects the most informative examples (here, convincing wrong-answer solutions) for human labeling

Best-of-N search: An inference strategy where N solutions are generated, ranked by a reward model, and the highest-ranked solution is selected

Credit Assignment: The problem of determining which past action or step is responsible for a final outcome (success or failure)

Chain-of-thought: A prompting/generation method where the model produces intermediate reasoning steps before the final answer