Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

📝 Paper Summary

Process Reward Models (PRMs) Reinforcement Learning for Reasoning Test-time Compute Scaling

Process Advantage Verifiers (PAVs) improve reasoning by rewarding steps based on 'progress' (advantages) measured under a separate prover policy rather than absolute correctness under the base policy.

Core Problem

Outcome Reward Models (ORMs) provide sparse feedback that is inefficient for search and learning, while standard automated Process Reward Models (PRMs) using value functions fail to distinguish good steps from promised states.

Why it matters:

Sparse outcome signals make RL sample-inefficient and fail to guide exploration in complex multi-step reasoning tasks
Using the base policy's own Q-values as rewards is redundant for RL updates (equivalent to outcome rewards) and fails to incentivize exploration of novel correct paths
Standard Q-value search is inefficient because it conflates the quality of a specific action with the high value of the state it came from

Concrete Example: In a math problem, a strong base policy might assign high Q-values to both a correct step and a trivial 'rephrasing' step because it can solve the problem from either. A Q-value based search would keep both. A PAV using a complementary prover would assign a high 'advantage' only to the step that actually increases success probability, pruning the trivial one.

Key Novelty

Process Advantage Verifiers (PAVs) with Complementary Provers

Define process rewards as the 'advantage' (change in success probability) of a step, rather than the absolute value of the resulting state
Compute these advantages using a 'prover policy' different from the base policy (e.g., a Best-of-K policy), ensuring the signal distinguishes step quality even when the base policy is confident
Use these advantage scores as dense rewards for both test-time beam search and online Reinforcement Learning (RL)

Evaluation Highlights

Beam search with PAVs is >8% more accurate and 1.5-5x more compute-efficient than ORM baselines on MATH using Gemma models
Online RL with PAV dense rewards is 6x more sample-efficient than ORM-RL to reach the same accuracy
PAV-RL improves accuracy by >6% over ORM-RL baselines on Gemma-2B and 9B models

Breakthrough Assessment

8/10

Significant efficiency gains in both inference search and RL training. Theoretical characterization of 'good provers' provides a new principled direction for PRM design beyond just 'better labels'.

⚙️ Technical Details

Problem Definition

Setting: Multi-step mathematical reasoning where a policy generates a sequence of steps to reach a final answer

Inputs: Math problem x

Outputs: Multi-step reasoning trace y = (a_1, a_2, ..., a_H)

Pipeline Flow

Input Problem
Step Sampling (Base Policy)
Scoring (PAV + ORM)
Beam Pruning
Final Selection

System Modules

Base Policy

Generate candidate reasoning steps

Model or implementation: Gemma-2B / 9B / 27B (SFT or RL trained)

Process Advantage Verifier (PAV)

Predict the advantage of a candidate step under the prover policy

Model or implementation: Gemma-2B / 9B / 27B (Classifier head)

Beam Search Pruner

Select top candidates for the next beam iteration based on effective reward

Model or implementation: Algorithm

Novel Architectural Elements

Deoupling the evaluation policy (Prover) from the generation policy (Base) for reward calculation
Use of 'Effective Reward' combining base Q-value and prover Advantage for ranking/training

Modeling

Base Model: Gemma-2B, Gemma-9B, Gemma-27B

Training Method: Online Reinforcement Learning (Policy Gradient variant)

Objective Functions:

Purpose: Maximize effective reward combining outcome and process signals.

Formally: grad(J) = sum [ grad(log pi) * (Q_pi + alpha * A_mu) ]

Training Data:

MATH dataset
PAV training data: Seed rollouts from prover, followed by Monte Carlo rollouts (n_mc) from intermediate prefixes to estimate Q-values and Advantages

Key Hyperparameters:

n_cov: Number of seed rollouts for coverage
n_mc: Number of Monte Carlo rollouts for value estimation
alpha: Weighting factor for process rewards in effective reward formulation

Compute: Not reported in the paper

Comparison to Prior Work

vs. Snell et al. (2024): Uses Advantage (progress) instead of Value (promise) and a separate Prover policy
vs. Lightman et al. (2023): Uses automated value estimation instead of human labels
vs. Shao et al. (2024): Achieves 6x sample efficiency gains in RL vs their 1-2% gains, by using a separate prover

Limitations

Requires training a separate prover policy or running Best-of-K during data creation (expensive)
Advantages from very strong provers or very weak provers collapse to zero, requiring careful prover selection
Data collection for PAVs is compute intensive due to Monte Carlo rollouts at multiple prefixes

Reproducibility

No code provided. Method relies on generating large-scale rollout data (Monte Carlo estimation) which is compute-intensive. Prover policies (Best-of-K) require running multiple inferences per step during data creation.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning on the MATH dataset

Benchmarks:

MATH (Mathematical Problem Solving)

Metrics:

Accuracy (Pass@1)
Pass@N (Best-of-N performance)
Compute Efficiency (Accuracy vs Compute Cost)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Test-time search experiments comparing Beam Search with PAVs against ORM re-ranking and standard PRM (Q-value) search.
MATH	Compute Efficiency (Multiplier)	1.0	5.0	+4.0
MATH	Accuracy	0.33	0.41	+0.08
Online RL experiments comparing dense rewards from PAVs against standard sparse Outcome Reward (ORM) RL.
MATH	Sample Efficiency (Multiplier)	1.0	6.0	+5.0
MATH	Accuracy (Gemma-9B)	0.48	0.54	+0.06
MATH	Accuracy (Gemma-2B)	0.20	0.27	+0.07

Main Takeaways

Measuring progress (advantage) is strictly better than measuring absolute quality (value) for guiding search and RL exploration
Using a complementary prover policy (like Best-of-4) is critical; using the base policy itself as a prover yields minimal gains
PAVs enable solving 'hard' problems that base SFT policies fail to solve even with large sampling budgets (Pass@256)

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, Value Functions, Advantages)
Language Model Reasoning (Chain-of-Thought)
Search Algorithms (Beam Search, Best-of-N)

Key Terms

ORM: Outcome Reward Model—a verifier that predicts the correctness of a final complete solution

PRM: Process Reward Model—a verifier that scores intermediate reasoning steps

PAV: Process Advantage Verifier—the proposed model that predicts the 'advantage' (progress) of a step under a prover policy

Prover Policy: A policy (distinct from the base policy) used to estimate the value of states for calculating advantages; serves as the 'judge' of progress

Advantage: The difference between the value of taking a specific action at a state and the average value of that state; measures how much better/worse an action is relative to expectation

Best-of-K: A policy strategy that samples K solutions and selects the best one according to a verifier; used here as a strong 'prover' policy

Pass @ K: The probability that at least one of K generated solutions is correct

SFT: Supervised Fine-Tuning—training on labeled demonstrations

RFT: Rejection Fine-Tuning—fine-tuning on self-generated samples that are verified as correct