← Back to Paper List

Process Reward Model with Q-Value Rankings

Wendi Li, Yixuan Li
Department of Computer Science, Huazhong University of Science and Technology, Department of Computer Sciences, University of the Wisconsin-Madison
International Conference on Learning Representations (2024)
RL Reasoning

📝 Paper Summary

Process Reward Modeling (PRM) Mathematical Reasoning
PQM reformulates process reward modeling as a Q-value ranking problem within a Markov Decision Process to capture step interdependencies, outperforming classification-based methods that treat steps in isolation.
Core Problem
Existing Process Reward Models (PRMs) typically use binary cross-entropy loss to classify each reasoning step independently, ignoring the sequential dependencies and relative importance of steps within a trajectory.
Why it matters:
  • Independent classification leads to suboptimal reward distribution because it fails to capture how earlier steps influence the validity of later ones
  • Current methods lack theoretical grounding for how their scoring approximates the true probability of success
  • In complex reasoning (e.g., math), a single misstep can invalidate the entire subsequent chain, a nuance missed by independent step classifiers
Concrete Example: In a math problem, a classification-based PRM might score a trivial correct step equally to a crucial breakthrough step. Conversely, PQM assigns scores based on the step's contribution to the probability of final success, recognizing that Q-values should ascend as a correct solution progresses.
Key Novelty
Process Q-value Model (PQM)
  • Frames the reasoning process as a Markov Decision Process (MDP) where the reward for a step is its Q-value (probability of reaching the correct answer from that state)
  • Derives theoretical rankings proving that Q-values should ascend for correct step sequences and descend for incorrect ones, with a distinct gap between the two
  • Optimizes the model using a comparative ranking loss rather than independent binary classification to better approximate these theoretical dynamics
Evaluation Highlights
  • +11.6% improvement in verification accuracy on the MATH500 benchmark compared to classification-based PRMs when verifying Llama-3-70B-Instruct solutions
  • Validates theoretical proofs showing Q-values ascend for correct trajectories and descend for incorrect ones (visualized in analysis)
Breakthrough Assessment
7/10
Provides strong theoretical grounding (MDP formulation) for an empirically heuristic field (PRMs). Significant quantitative gains on MATH500, though the paper snippet limits assessment of broader generalization.
×