← Back to Paper List

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Amrith Rajagopal Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, Aviral Kumar
Google Research, Google DeepMind, Carnegie Mellon University
International Conference on Learning Representations (2024)
RL Reasoning

📝 Paper Summary

Process Reward Models (PRMs) Reinforcement Learning for Reasoning Test-time Compute Scaling
Process Advantage Verifiers (PAVs) improve reasoning by rewarding steps based on 'progress' (advantages) measured under a separate prover policy rather than absolute correctness under the base policy.
Core Problem
Outcome Reward Models (ORMs) provide sparse feedback that is inefficient for search and learning, while standard automated Process Reward Models (PRMs) using value functions fail to distinguish good steps from promised states.
Why it matters:
  • Sparse outcome signals make RL sample-inefficient and fail to guide exploration in complex multi-step reasoning tasks
  • Using the base policy's own Q-values as rewards is redundant for RL updates (equivalent to outcome rewards) and fails to incentivize exploration of novel correct paths
  • Standard Q-value search is inefficient because it conflates the quality of a specific action with the high value of the state it came from
Concrete Example: In a math problem, a strong base policy might assign high Q-values to both a correct step and a trivial 'rephrasing' step because it can solve the problem from either. A Q-value based search would keep both. A PAV using a complementary prover would assign a high 'advantage' only to the step that actually increases success probability, pruning the trivial one.
Key Novelty
Process Advantage Verifiers (PAVs) with Complementary Provers
  • Define process rewards as the 'advantage' (change in success probability) of a step, rather than the absolute value of the resulting state
  • Compute these advantages using a 'prover policy' different from the base policy (e.g., a Best-of-K policy), ensuring the signal distinguishes step quality even when the base policy is confident
  • Use these advantage scores as dense rewards for both test-time beam search and online Reinforcement Learning (RL)
Evaluation Highlights
  • Beam search with PAVs is >8% more accurate and 1.5-5x more compute-efficient than ORM baselines on MATH using Gemma models
  • Online RL with PAV dense rewards is 6x more sample-efficient than ORM-RL to reach the same accuracy
  • PAV-RL improves accuracy by >6% over ORM-RL baselines on Gemma-2B and 9B models
Breakthrough Assessment
8/10
Significant efficiency gains in both inference search and RL training. Theoretical characterization of 'good provers' provides a new principled direction for PRM design beyond just 'better labels'.
×