Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

📝 Paper Summary

Mathematical Reasoning Process Reward Models (PRM) Reinforcement Learning from Feedback

MATH-SHEPHERD automates the training of process reward models by verifying intermediate reasoning steps through Monte Carlo-style rollouts, eliminating the need for expensive human annotations.

Core Problem

Process Reward Models (PRMs) improve reasoning reliability but traditionally require costly and hard-to-scale human annotations to label individual steps as correct or incorrect.

Why it matters:

Human annotation for complex multi-step math problems requires high skill levels and is prohibitively expensive to scale
Outcome Reward Models (ORMs) only grade the final answer, failing to identify specific errors in the reasoning chain, which limits feedback quality for reinforcement learning
Relying solely on top-1 generation from LLMs is unreliable; effective verification is needed to select correct solutions from candidates

Concrete Example: In a polynomial problem requiring $p(0) + p(4)$, an ORM might mark a solution with the wrong answer '20' as incorrect (0 score). However, the solution might have a correct first step (factoring the polynomial correctly) followed by a calculation error. An ORM misses this nuance, whereas a PRM can reward the correct first step while penalizing the subsequent error.

Key Novelty

Automatic Process Annotation via Reasoning Rollouts

Defines the quality of an intermediate step by its potential to reach the correct final answer, inspired by Monte Carlo Tree Search
Uses a 'completer' model to generate multiple future paths from a specific step; if these paths lead to the ground truth answer, the step is automatically labeled as valid
Constructs a massive supervised dataset of step-wise labels without human intervention to train a Process Reward Model

Architecture

Comparison between automatic Outcome Annotation and the proposed Automatic Process Annotation pipeline

Evaluation Highlights

DeepSeek-67B with MATH-SHEPHERD verification achieves 93.3% accuracy on GSM8K (+5.1% over Self-Consistency)
Mistral-7B trained with step-by-step PPO using MATH-SHEPHERD improves from 28.6% to 33.0% on the MATH dataset
LLaMA2-70B with MATH-SHEPHERD verification achieves 44.5% on MATH500, outperforming Outcome Reward Models (40.4%) and Self-Consistency (39.4%)

Breakthrough Assessment

8/10

Significantly lowers the barrier for training Process Reward Models by removing the human annotation bottleneck. Achieves state-of-the-art results on open-source models without external tools.

⚙️ Technical Details

Problem Definition

Setting: Step-by-step mathematical reasoning where each step $s_i$ in a solution $S$ needs a quality score

Inputs: Math problem $p$, partial solution steps $s_{1...i}$

Outputs: Reward score $r_{s_i} \in [0, 1]$ indicating the correctness/potential of the step

Pipeline Flow

Step Selection: Identify a specific reasoning step $s_i$ in a solution
Completion (Rollout): Use a 'Completer' LLM to generate $N$ full solution paths starting from $s_i$
Automatic Annotation: Compare final answers of rollouts to the Ground Truth. Assign label to $s_i$ based on success rate (Hard or Soft Estimation)
PRM Training: Train a verifier to predict these labels
Deployment: Use PRM for Verification (Reranking) or Reinforcement Learning (Step-by-step PPO)

System Modules

Completer

Generates future reasoning paths from intermediate steps

Model or implementation: LLemma-7B (fine-tuned on MetaMath)

Process Reward Model (MATH-SHEPHERD)

Assigns a scalar score to each reasoning step

Model or implementation: Based on LLaMA2-70B, LLemma-34B, or Mistral-7B

Generator

Solves math problems

Model or implementation: Mistral-7B, DeepSeek-67B, LLaMA2-70B

Novel Architectural Elements

Automatic Process Annotation pipeline: Coupling a 'Completer' model with Ground Truth validation to synthesize step-level labels without humans
Step-by-step PPO integration: Applying PPO updates using dense rewards at every newline/step rather than sparse terminal rewards

Modeling

Base Model: Evaluated on LLaMA2-70B, LLemma-34B, Mistral-7B, DeepSeek-67B

Training Method: Step-by-step Proximal Policy Optimization (PPO)

Objective Functions:

Purpose: Train the PRM to predict step correctness.

Formally: $L_{PRM} = \sum [y_{si} \log r_{si} + (1-y_{si}) \log (1-r_{si})]$ (Binary Cross Entropy)
Purpose: Reinforce the generator using PRM scores.

Formally: Standard PPO objective where reward $r_t$ is provided at each step $t$ by the PRM

Training Data:

Sampled 15 solutions per problem from 7B/13B models on GSM8K/MATH train sets
Used LLemma-7B as completer with N=8 rollouts
Generated ~170k solutions for GSM8K and ~270k for MATH

Key Hyperparameters:

learning_rate_prm: 1e-6
learning_rate_ppo_mistral: 1e-7
kl_coefficient: 0.04
+ 2 more
prm_epochs: 1
max_seq_len: 512

Compute: Used 3D parallelism provided by hfai. Verification uses 256 candidate samples.

Comparison to Prior Work

vs. ORM: Provides dense step-level feedback, enabling more granular verification and RL
vs. PRM800K: Uses automatically synthesized labels instead of expensive human annotations; dataset is 4x larger
vs. DIVERSE (Li et al., 2023b): Uses rollout-based estimation (Completer) rather than NLI or rule-based heuristics to determine step correctness

Limitations

Computational cost of the 'completion' process during data construction (requires decoding N paths for every step)
Automatic annotations contain noise (false positives/negatives) compared to gold human labels
Reward models may not generalize perfectly to distributions significantly different from the base model used for completion

Reproducibility

Project page mentioned as MATH-SHEPHERD but no URL provided in text. Base models (Mistral, LLaMA2, LLemma, DeepSeek) are open source. Training relies on MetaMath dataset.

📊 Experiments & Results

Evaluation Setup

Math problem solving with step-by-step reasoning

Benchmarks:

GSM8K (Grade school math word problems)
MATH (Competition-level mathematics)
MATH500 (Subset of 500 MATH test problems for verification evaluation)

Metrics:

Accuracy (Pass@1)
Best-of-N Accuracy (Verification score)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Verification (Reranking) Performance: MATH-SHEPHERD consistently outperforms baselines across varying model sizes and datasets.
GSM8K	Accuracy (Best-of-256)	88.2	93.3	+5.1
MATH500	Accuracy (Best-of-256)	40.4	44.5	+4.1
MATH500	Accuracy (Best-of-256)	39.4	44.5	+5.1
Reinforcement Learning (PPO) Performance: Step-by-step PPO with MATH-SHEPHERD improves base model performance significantly.
GSM8K	Accuracy (Greedy)	77.9	84.1	+6.2
MATH	Accuracy (Greedy)	28.6	33.0	+4.4
MATH	Accuracy (Greedy)	31.3	33.0	+1.7

Experiment Figures

Verification performance (Best-of-N) scaling with number of solutions N on GSM8K and MATH

Main Takeaways

Process Reward Models (PRMs) trained on automatically generated data outperform Outcome Reward Models (ORMs) and Self-Consistency, especially on difficult tasks like MATH
Step-by-step reinforcement learning (PPO) using PRMs provides a stronger training signal than outcome-based PPO, leading to significant accuracy gains
The 'Completer' quality matters: larger models used for automatic data annotation lead to better PRMs
The method generalizes well, showing improvements across model families (LLaMA, Mistral, DeepSeek) and scales effectively with the number of candidate solutions (Best-of-N)

📚 Prerequisite Knowledge

Prerequisites

Process Reward Model (PRM) vs Outcome Reward Model (ORM)
Proximal Policy Optimization (PPO)
Monte Carlo Tree Search (MCTS) concepts (rollouts)

Key Terms

PRM: Process Reward Model—a model that scores each individual step of a reasoning chain rather than just the final answer

ORM: Outcome Reward Model—a model that scores the entire generated solution based on whether the final answer is correct

PPO: Proximal Policy Optimization—a reinforcement learning algorithm used to finetune LLMs, here applied step-by-step using PRM rewards

Hard Estimation: A binary labeling strategy where a step is labeled '1' if *any* generated completion leads to the correct answer, and '0' otherwise

Soft Estimation: A continuous labeling strategy where a step's label is the *fraction* of generated completions that reach the correct answer

RFT: Rejective Sampling Fine-Tuning—a method where the model is fine-tuned on its own correct outputs

GSM8K: Grade School Math 8K—a benchmark dataset of grade-school level math word problems

MATH: Mathematics Dataset—a challenging dataset of competition-level math problems

Self-Consistency: A verification method that samples multiple reasoning paths and selects the answer that appears most frequently (majority voting)

Completer: A language model used to generate full reasoning paths starting from a specific intermediate step to check if that step can lead to the correct answer