← Back to Paper List

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Zhenru Zhang, Chujie Zheng, Yang Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin
Qwen Team, Alibaba Group
Annual Meeting of the Association for Computational Linguistics (2025)
RL Reasoning Benchmark

📝 Paper Summary

Process Reward Models (PRMs) Mathematical Reasoning
This paper identifies critical flaws in Monte Carlo-based data synthesis and Best-of-N evaluation for math verifiers, proposing a consensus filtering approach that significantly improves step-wise error detection.
Core Problem
Common Monte Carlo (MC) data synthesis for Process Reward Models is noisy because models often arrive at correct answers via incorrect steps, while Best-of-N (BoN) evaluation masks these process failures by rewarding outcome-only correctness.
Why it matters:
  • Monte Carlo estimation generates false positives (correct answer from wrong steps) and false negatives, creating noisy training data that hurts generalization
  • Optimizing solely for BoN scores causes PRMs to degenerate into Outcome Reward Models (ORMs), where scores concentrate on the final step rather than verifying intermediate logic
  • Existing PRMs fail to detect specific reasoning errors, undermining the reliability of AI mathematical reasoning even when final answers are correct
Concrete Example: A policy model might generate a solution where '2 + 2 = 5' appears in an intermediate step, but subsequent errors cancel it out to reach the correct final answer. Monte Carlo estimation would label '2 + 2 = 5' as correct because the final answer matched, confusing the PRM during training.
Key Novelty
Consensus Filtering Mechanism
  • Integrates Monte Carlo estimation with LLM-as-a-judge by retaining training instances only when both methods agree on the correctness of reasoning steps
  • Demonstrates that strict hard-label training (binary correct/incorrect) derived from this consensus outperforms soft-label probability training
  • Advocates for dual evaluation using both response-level Best-of-N and step-level ProcessBench to prevent PRMs from ignoring intermediate process quality
Evaluation Highlights
  • Qwen2.5-Math-PRM-7B achieves 73.5% Mean F1 on ProcessBench, outperforming the previous state-of-the-art Math-Shepherd-PRM-7B (31.5%) by +42.0 points
  • In Best-of-8 evaluation, Qwen2.5-Math-PRM-7B achieves 67.6% average accuracy, surpassing Math-Shepherd-PRM-7B (64.2%) and human-annotated PRM800K baselines
  • Consensus filtering matches the performance of full LLM-as-a-judge training while using only ~40% of the data, demonstrating significantly improved data efficiency
Breakthrough Assessment
9/10
Identifies fundamental flaws in standard PRM training (MC estimation) and evaluation (BoN bias). The proposed consensus filtering yields massive gains in step-wise verification accuracy (+40% F1) compared to existing baselines.
×