Entropy-Regularized Process Reward Model

📝 Paper Summary

Process Reward Models (PRM) Reinforcement Learning for Reasoning Synthetic Data Generation

ER-PRM derives a theoretically grounded process reward formulation from KL-regularized RL, labeling reasoning steps based on their potential to reach the correct answer relative to a reference policy.

Core Problem

Existing automatic process reward labeling methods (like Math-Shepherd) use traditional MDP frameworks that ignore the KL-regularization constraint standard in modern LLM alignment, leading to suboptimal reward signals.

Why it matters:

Mathematical reasoning requires accurate step-by-step verification, but human annotation is prohibitively expensive
Current automatic labeling methods lack theoretical alignment with the entropy-regularized objective used in RLHF, potentially causing the policy to deviate too far from the base model
High-quality process rewards are the upper bound for reasoning performance in inference-time search and training-time alignment

Concrete Example: In Math-Shepherd, a step's score is simply the raw probability of reaching a correct answer. However, this ignores the prior probability of the step under the reference model. A step might have high success probability simply because it's a common path, not because it's inherently 'better' relative to the reference, leading to miscalibrated rewards.

Key Novelty

Entropy-Regularized Process Reward Model (ER-PRM)

Formulates the reward modeling problem under an entropy-regularized MDP framework, aligning the reward definition with the standard KL-constrained RL objective
Derives a specific 'soft-max' style reward calculation where the value of a step is determined by the aggregated exponential rewards of possible completions sampled from the reference policy
Uses this derived formula to automatically label reasoning steps generated by Monte Carlo sampling, creating a dataset for training a dense reward model

Architecture

Comparison of reward calculation methods: Hard-label PRM vs. Soft-label PRM (Math-Shepherd) vs. ER-PRM (Ours).

Evaluation Highlights

+2-3% improvement on MATH benchmark using Best-of-N sampling compared to existing process reward models
Achieves 1% improvement on GSM8K benchmark over baselines
Substantial performance gains in RLHF (rejection sampling fine-tuning), outperforming policy models trained with Hard-label PRM, Soft-label PRM, and Outcome Reward Models

Breakthrough Assessment

7/10

Provides a strong theoretical grounding for process reward construction that was previously heuristic. Empirical gains are consistent and significant, though the core mechanism is a refinement of Math-Shepherd rather than a completely new paradigm.

⚙️ Technical Details

Problem Definition

Setting: Multi-step mathematical reasoning where a policy generates a sequence of reasoning steps

Inputs: A math problem prompt x

Outputs: A step-by-step reasoning chain a = [a1, ..., aL] leading to a final answer

Pipeline Flow

Generator (samples diverse reasoning paths)
Completer (simulates completions for each step)
Labeler (calculates ER-PRM scores)
Reward Model Training (learns to predict scores)
Policy Optimization (Best-of-N or RLHF)

System Modules

Generator (Data Generation)

Generate initial diverse solutions to math problems to be labeled

Model or implementation: Mistral-MetaMath-7B or DeepSeek-math-7B-instruct

Completer (Data Generation)

Perform Monte Carlo rollouts from intermediate steps to estimate value

Model or implementation: Same as Generator

Labeler (Data Generation)

Calculate the entropy-regularized reward score for each step

Model or implementation: Mathematical Formula (Equation 2)

Reward Model

Predict the quality score of a reasoning step

Model or implementation: Llama-3.1-8B

Novel Architectural Elements

Entropy-regularized reward aggregation function used during the labeling phase, replacing the simple average/probability used in Math-Shepherd

Modeling

Base Model: Llama-3.1-8B (for Reward Model)

Training Method: Supervised training on automatically labeled data (Reward Model) followed by Rejection Sampling (Policy)

Objective Functions:

Purpose: Train the reward model to predict the soft labels derived from MC rollouts.

Formally: Cross-entropy loss on the predicted probability of a special token '+', regressing to the calculated ER-PRM score.

Adaptation: Full fine-tuning

Training Data:

Source: GSM8K and MATH training sets
Expansion: ~260K data points with ~1.5 million step annotations generated via Monte Carlo sampling (15 solutions * 16 completions)

Key Hyperparameters:

learning_rate: 1e-6
batch_size: 64 (global)
max_sequence_length: 512
+ 2 more
eta: 2 (Mistral data) or 10 (DeepSeek data)
epochs: 1

Compute: Four H100 GPUs for reward model training

Comparison to Prior Work

vs. Math-Shepherd: ER-PRM uses a 'soft-max' (log-sum-exp style) aggregation derived from KL-regularized RL theory, whereas Math-Shepherd uses a simple average (win rate).
vs. ORM: ER-PRM provides dense, step-level signals rather than a sparse final signal.
vs. Human-labeled PRM: ER-PRM is fully automated and scalable, not requiring human annotation.

Limitations

Computational cost of data generation is high (requires many rollouts per step for the training data)
Depends on the quality of the generator/completer model; a weak model may not estimate future value accurately
Only evaluated on math problems; generalization to other reasoning domains (e.g., coding) is untested

Reproducibility

The paper states they will release 'all the data, code, and checkpoints'. Hyperparameters for labeling (eta) and training (LR, batch size) are explicitly provided. Generator/Completer models are open-source (Mistral-MetaMath, DeepSeek-Math).

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning using Chain-of-Thought

Benchmarks:

GSM8K (Grade school math word problems)
MATH (Challenging competition-level math problems)

Metrics:

Pass@1 / Accuracy (using Best-of-N)
Top-1 Accuracy (after RLHF/Rejection Sampling)
Statistical methodology: Reported average and standard deviation over 5 seeds

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Best-of-N evaluation results comparing ER-PRM against baselines on MATH and GSM8K.
MATH	Accuracy (Best-of-64)	45.8	48.2	+2.4
RLHF (Rejection Sampling) results showing the quality of the policy fine-tuned with data selected by different reward models.
MATH	Accuracy (Zero-shot CoT)	35.2	36.5	+1.3

Main Takeaways

ER-PRM consistently outperforms Math-Shepherd (Soft-label PRM) and Hard-label PRM across benchmarks.
The entropy-regularized formulation provides a more robust signal for both inference-time search (Best-of-N) and training-time alignment (RLHF).
The method is effective with different base models (Mistral and DeepSeek), showing robustness across model families.
Rejection sampling using ER-PRM yields better policy models than those trained with outcome rewards or heuristic process rewards.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Markov Decision Processes (MDP)
Monte Carlo Tree Search (MCTS)
KL-divergence regularization

Key Terms

PRM: Process Reward Model—a model that assigns a score to each intermediate step of a reasoning chain rather than just the final outcome

ORM: Outcome Reward Model—a model that scores only the final answer of a reasoning chain

KL-regularization: A constraint used in RL to prevent the new policy from deviating too far from the reference (initial) policy, ensuring stability and retaining prior knowledge

Math-Shepherd: A baseline method for automatically labeling process rewards by estimating the probability of reaching a correct answer from a given step

Best-of-N: An inference-time strategy where N solutions are generated, and the one with the highest reward score is selected

Rejection Sampling: A training strategy where samples are generated, filtered/ranked by a reward model, and the best ones are used to fine-tune the policy

CoT: Chain-of-Thought—a prompting strategy that encourages the model to generate intermediate reasoning steps

Soft-max: In this context, a specific aggregation function for rewards derived from the entropy-regularized objective: log(E[exp(reward)])

RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using reward signals derived from human preferences or, in this case, correctness checks