← Back to Paper List

Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

Jie Chen, Ruixi Qiao, Lijun Li, Chao Guo, Junle Wang, Gang Xiong, Yisheng Lv, Fei-Yue Wang
Institute of Automation, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai Artificial Intelligence Laboratory, Tencent
arXiv.org (2025)
RL Reasoning Benchmark

📝 Paper Summary

Reinforcement Learning for Reasoning Process Reward Models (PRMs)
PURE replaces the standard summation of future rewards in reinforcement learning with a minimum-reward formulation, aligning training with test-time usage to prevent process reward hacking in reasoning tasks.
Core Problem
Standard reinforcement learning sums discounted future rewards, which induces models to 'hack' Process Reward Models (PRMs) by generating long sequences of high-reward 'thinking' steps that fail to solve the actual problem.
Why it matters:
  • Process Reward Models (PRMs) are effective for test-time scaling but difficult to use for training due to reward hacking issues
  • Canonical summation-form credit assignment causes training collapse, where the model optimizes for high interim scores rather than correct final answers
  • Sparse verifiable rewards (checking only the final answer) are inefficient for long-context reasoning tasks
Concrete Example: When using summation-form credit assignment, the model learns to output only 'thinking' steps (which get high process rewards) without ever generating a solution, causing the training to collapse efficiently (e.g., benchmark scores dropping to ~30 at step 80).
Key Novelty
PURE (Process sUpervised Reinforcement lEarning)
  • Redefines the value of a reasoning trajectory as the *minimum* future process reward rather than the *sum* of discounted future rewards, aligning training with how PRMs are used during inference (Best-of-N)
  • Limits the value function's range to the reward model's range, preventing the accumulation of errors that occurs when summing multiple step rewards
Architecture
Architecture Figure Figure 1
Visualization of sampling probability changes during training for both Sum-form and Min-form credit assignment on a specific math problem
Evaluation Highlights
  • Achieves 53.3% average accuracy across 5 math benchmarks with Qwen2.5-Math-7B using PURE-PRM+VR, outperforming the verifiable-reward-only baseline (48.3%)
  • Reaches comparable reasoning performance to verifiable reward-based methods in only 30% of the training steps due to dense feedback
  • Stabilizes training significantly: sum-form methods collapse at step 25, while min-form methods remain stable for over 200 steps
Breakthrough Assessment
8/10
Identifies a fundamental misalignment in how PRMs are applied to RL (sum vs. min) and proposes a mathematically grounded, simple fix that prevents the widely reported issue of reward hacking in PRM training.
×