Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

📝 Paper Summary

Reinforcement Learning for Reasoning Process Reward Models (PRMs)

PURE replaces the standard summation of future rewards in reinforcement learning with a minimum-reward formulation, aligning training with test-time usage to prevent process reward hacking in reasoning tasks.

Core Problem

Standard reinforcement learning sums discounted future rewards, which induces models to 'hack' Process Reward Models (PRMs) by generating long sequences of high-reward 'thinking' steps that fail to solve the actual problem.

Why it matters:

Process Reward Models (PRMs) are effective for test-time scaling but difficult to use for training due to reward hacking issues
Canonical summation-form credit assignment causes training collapse, where the model optimizes for high interim scores rather than correct final answers
Sparse verifiable rewards (checking only the final answer) are inefficient for long-context reasoning tasks

Concrete Example: When using summation-form credit assignment, the model learns to output only 'thinking' steps (which get high process rewards) without ever generating a solution, causing the training to collapse efficiently (e.g., benchmark scores dropping to ~30 at step 80).

Key Novelty

PURE (Process sUpervised Reinforcement lEarning)

Redefines the value of a reasoning trajectory as the *minimum* future process reward rather than the *sum* of discounted future rewards, aligning training with how PRMs are used during inference (Best-of-N)
Limits the value function's range to the reward model's range, preventing the accumulation of errors that occurs when summing multiple step rewards

Architecture

Visualization of sampling probability changes during training for both Sum-form and Min-form credit assignment on a specific math problem

Evaluation Highlights

Achieves 53.3% average accuracy across 5 math benchmarks with Qwen2.5-Math-7B using PURE-PRM+VR, outperforming the verifiable-reward-only baseline (48.3%)
Reaches comparable reasoning performance to verifiable reward-based methods in only 30% of the training steps due to dense feedback
Stabilizes training significantly: sum-form methods collapse at step 25, while min-form methods remain stable for over 200 steps

Breakthrough Assessment

8/10

Identifies a fundamental misalignment in how PRMs are applied to RL (sum vs. min) and proposes a mathematically grounded, simple fix that prevents the widely reported issue of reward hacking in PRM training.

⚙️ Technical Details

Problem Definition

Setting: Step-level Markov Decision Process for LLM reasoning

Inputs: Prompt p

Outputs: Sequence of reasoning steps a_1, ..., a_n

Pipeline Flow

Policy Model (generates response steps)
Process Reward Model (scores each step)
Credit Assignment (calculates advantages via Min-Form)
RL Update (updates Policy)

System Modules

Policy Model

Generates reasoning steps sequentially based on the prompt

Model or implementation: Qwen2.5-Math-7B (and variants)

Process Reward Model (PRM)

Assigns a scalar score to each generated step

Model or implementation: PURE-PRM-7B (trained on PRM800K)

Credit Assignment Module

Computes returns using Min-Form aggregation instead of summation

Model or implementation: Mathematical Function

Novel Architectural Elements

Min-form credit assignment function: Value of a state is defined by the minimum reward of any future step in the trajectory, rather than the discounted sum

Modeling

Base Model: Qwen2.5-Math-7B

Training Method: Process sUpervised Reinforcement lEarning (PURE) using RLOO estimator

Objective Functions:

Purpose: Calculate the return of a trajectory based on the worst step.

Formally: G_t = min(r*_t, ..., r*_n) where r* is a transformed reward.
Purpose: Transform process rewards to emphasize differences.

Formally: r* = -log(r) / T

Adaptation: Full fine-tuning

Training Data:

SimpleRL dataset (subset of MATH)
800 problems with Ground Truth (GT) answers
7,200 open problems (no GT) for PRM-only signal

Key Hyperparameters:

learning_rate: 1e-6
batch_size: 512
kl_coefficient: 1e-3
+ 3 more
transform_temperature_T: 0.1
responses_per_prompt: 8
max_generation_length: 8192

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek R1-Zero: PURE uses dense process rewards with min-form aggregation, requiring fewer steps and less ground truth data
vs. Eurus-2: PURE uses explicit PRM scores with a specific credit assignment fix (min-form) rather than implicit rewards
vs. Standard RL (PPO): PURE changes the return calculation from Sum(gamma * r) to Min(r) to align with PRM inference objectives

Limitations

Reward hacking is delayed but still inevitable when relying solely on PRMs (requires mixing in ~10% verifiable rewards for best stability)
Depends on the quality of the trained PRM (though PURE-PRM-7B is released)
Min-form assumption (worst step determines value) is heuristic-based, derived from Best-of-N inference logic

Reproducibility

Code: https://github.com/CJReinforce/PURE

Code and model weights are publicly available at https://github.com/CJReinforce/PURE. Dataset is derived from SimpleRL (open source).

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks evaluated using Pass@1 accuracy

Benchmarks:

AIME24 (Competition Math)
AMC23 (Competition Math)
MATH500 (Mathematical Problem Solving)
Minerva Math (Mathematical Problem Solving)
OlympiadBench (Olympiad-level Math)

Metrics:

Pass@1 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of PURE variants against verifiable-reward (VR) baselines on Qwen2.5-Math-7B.
Average (5 Benchmarks)	Accuracy	48.3	53.3	+5.0
Average (5 Benchmarks)	Accuracy	48.0	53.3	+5.3
AMC23	Accuracy	73.8	82.5	+8.7
Ablation showing performance of PRM-only methods compared to VR-only methods.
Average (5 Benchmarks)	Accuracy	48.3	49.3	+1.0

Experiment Figures

Training curves comparing Sum-form vs. Min-form stability and efficiency

Main Takeaways

Min-form credit assignment is essential for PRM-based training; sum-form causes immediate collapse (accuracy drops below base model by step 80)
Combining dense process rewards with sparse verifiable rewards (PURE-PRM+VR) yields the best performance (+5% over VR only)
PRM-based training is significantly more sample-efficient, reaching VR-baseline performance in ~30% of steps
Even with Min-form, pure PRM training eventually encounters reward hacking (verifiable rewards drop while process rewards rise), but adding 10% ground-truth data fixes this

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO/Policy Gradient)
Large Language Models (LLMs)
Markov Decision Processes (MDP)

Key Terms

PRM: Process Reward Model—a model that evaluates the correctness of each intermediate step in a reasoning chain, providing dense feedback

Credit Assignment: The problem of determining which past actions are responsible for a final outcome or reward

RFT: Reinforcement Fine-Tuning—improving a pre-trained model using reinforcement learning algorithms like PPO or GRPO

Verifiable Reward: A sparse reward signal (usually 0 or 1) given only at the end of generation based on whether the final answer matches the ground truth

RLOO: REINFORCE Leave-One-Out—a policy gradient estimator that uses the average reward of other samples as a baseline to reduce variance

DPO: Direct Preference Optimization—a method to align models using preference pairs without an explicit reward model loop

Reward Hacking: When an agent exploits flaws in the reward function to maximize points without achieving the intended goal (e.g., generating gibberish that the reward model likes)

Best-of-N: An inference strategy where the model generates N solutions and a reward model selects the best one