GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning

📝 Paper Summary

Process Reward Models (PRMs) Test-Time Scaling (TTS) Mathematical Reasoning

GenPRM redefines process supervision as a generative reasoning task enabling test-time compute scaling via verification-specific Chain-of-Thought and code execution, allowing smaller models to outperform larger discriminative baselines.

Core Problem

Existing Process Reward Models are trained as discriminative classifiers that output scalar scores, preventing them from leveraging the generative reasoning capabilities of LLMs or scaling compute at test time.

Why it matters:

Discriminative PRMs have limited process supervision and generalization capabilities compared to generative models.
Scalar prediction ignores the potential of 'thinking' (reasoning) about why a step is correct or incorrect.
Current verifiers cannot improve their judgment quality by spending more inference time (test-time scaling), unlike reasoning policies (e.g., o1).

Concrete Example: In a complex math problem, a standard PRM might assign a score of 0.8 to a step containing a subtle calculation error because it looks superficially correct. A generative PRM would attempt to write code to verify that specific calculation, execute it, find the discrepancy, and output a 'No' judgment.

Key Novelty

Generative Process Reward Model (GenPRM)

Transforms verification from classification (scalar output) to generation: the model produces Chain-of-Thought reasoning and Python code to verify a step before judging it.
Uses Relative Progress Estimation (RPE) to label training data, defining a 'correct' step as one that increases the probability of finding the final answer relative to the previous state.
Enables Test-Time Scaling for the verifier itself: by sampling multiple reasoning/code-verification paths and voting, the verifier's accuracy improves with more compute.

Architecture

The overall framework of GenPRM including data synthesis, training, and test-time scaling.

Evaluation Highlights

GenPRM-7B with majority voting (Maj@8) achieves 80.5% F1 on ProcessBench, outperforming the much larger Qwen2.5-Math-PRM-72B (78.3%).
GenPRM-1.5B (Maj@8) reaches 63.4% on ProcessBench, surpassing the proprietary GPT-4o (61.9%).
As a critic model, GenPRM-7B improves policy performance on MATH to 55.4% (Turn 3), significantly higher than DeepSeek-R1-Distill-7B (51.7%).

Breakthrough Assessment

9/10

Significant paradigm shift from discriminative to generative verification. Demonstrates that scaling verification compute allows small models to beat 10x larger models and GPT-4o. Highly relevant to current 'reasoning model' trends.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process where a verifier estimates the correctness of a reasoning step a_t given state s_t.

Inputs: Current problem context and reasoning history (s_t, a_t).

Outputs: A generated rationale v_t (containing CoT and code), execution feedback f_t, and a final correctness token (Yes/No).

Pipeline Flow

Input State (Problem + History + Current Step)
Generative Verification (CoT + Code Generation)
Code Execution (Run code, get feedback)
Final Judgment (Predict Yes/No token based on rationale and feedback)
Aggregation (Optional: Majority vote over N verification paths)

System Modules

Rationale Generator (Generative Verification)

Generate natural language analysis and Python verification code for the current step.

Model or implementation: GenPRM (based on DeepSeek-R1-Distill-Qwen)

Code Executor (Generative Verification)

Execute the generated verification code and return output.

Model or implementation: Python Interpreter

Judgment Head

Predict the final reward token (Yes/No) given the rationale and execution result.

Model or implementation: GenPRM (same model as generator)

Novel Architectural Elements

Integration of explicit code generation and execution loop *inside* the reward model's inference pass.
Generative verification head replacing the standard scalar regression head.

Modeling

Base Model: DeepSeek-R1-Distill-Qwen-1.5B, 7B, and 32B

Training Method: Supervised Fine-Tuning (SFT)

Adaptation: Full fine-tuning

Trainable Parameters: All parameters

Training Data:

Source: MATH dataset (7.5K problems).
Step 1: Generate solutions using Qwen2.5-7B-Instruct with step forcing.
Step 2: Monte Carlo (MC) estimation for step correctness (K=32 to 128 rollouts).
Step 3: Relative Progress Estimation (RPE) to determine labels (Threshold epsilon=0.8).
Step 4: Rationale Synthesis using QwQ-32B (CoT + Code), filtered by consensus.
Final Dataset: 23K problems with reasoning steps and rationales.

Key Hyperparameters:

batch_size: 64
learning_rate: 2.0e-6
temperature: 0.6 (evaluation)
+ 1 more
epsilon: 0.8 (RPE threshold)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Math-Shepherd/Skywork: GenPRM is generative, outputting text/code before judgment, whereas baselines are discriminative classifiers.
vs. Direct Generative PRM: GenPRM uses explicit CoT and Code Verification, while Direct Generative PRM only predicts Yes/No directly.
vs. DeepSeek-R1 [not cited in paper]: DeepSeek-R1 uses RL to incentivize reasoning for *solving*; GenPRM uses SFT on synthetic data to incentivize reasoning for *verification*.

Limitations

Generative reasoning introduces additional inference computation compared to scalar PRMs.
Effectiveness depends on the quality of the rationale synthesis model (QwQ-32B) used for training data creation.
Evaluation is primarily focused on mathematical reasoning tasks.
Relies on a fixed threshold (0.5) for binary judgment during test-time scaling.

Reproducibility

Code: https://ryanliu112.github.io/GenPRM

Code, model, and data are available at https://ryanliu112.github.io/GenPRM. The paper details the data synthesis pipeline (using Qwen2.5 and QwQ-32B) and the specific thresholding logic (RPE epsilon=0.8). Training used 23K data points.

📊 Experiments & Results

Evaluation Setup

Process supervision evaluation on math benchmarks.

Benchmarks:

ProcessBench (Step-level verification)
MATH (Mathematical Problem Solving)
AMC23/AIME24/Minerva Math (Mathematical Problem Solving)

Metrics:

Pass@1 (Accuracy)
Maj@N (Majority Vote Accuracy)
F1 Score (ProcessBench)
Best-of-N Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ProcessBench results demonstrate GenPRM's superior verification capability, where small GenPRM models using test-time scaling outperform significantly larger baselines.
ProcessBench	Avg F1 Score	78.3	80.5	+2.2
ProcessBench	Avg F1 Score	61.9	63.4	+1.5
ProcessBench	Avg F1 Score	31.5	75.2	+43.7
Critic Refinement results show GenPRM's ability to improve policy model outputs through iterative feedback.
MATH	Accuracy (Turn 3)	51.7	55.4	+3.7
AIME24	Accuracy (Turn 3)	18.8	22.8	+4.0
Ablation study on reasoning components confirms that both CoT and Code Verification contribute to performance.
ProcessBench	Avg F1 (Maj@8)	79.9	80.5	+0.6
ProcessBench	Avg F1 (Maj@8)	60.0	80.5	+20.5

Experiment Figures

Comparison of GenPRM against baselines on ProcessBench (left) and as a critic on MATH (right).

Best-of-N scaling results on multiple benchmarks (MATH, AMC23, AIME24, Minerva).

Main Takeaways

Test-time scaling of the verifier (GenPRM) is highly effective: a 1.5B model can outperform GPT-4o, and a 7B model can outperform a 72B model by simply reasoning more (Maj@8).
Relative Progress Estimation (RPE) with a high threshold (0.8) yields better training labels than standard hard MC estimation or lower thresholds.
The combination of natural language CoT and Code Verification provides robust supervision, significantly outperforming CoT-only or Code-only baselines.
GenPRM generalizes well as a critic model, improving the performance of different policy models (Qwen2.5, Gemma-3) across multiple turns of refinement.

📚 Prerequisite Knowledge

Prerequisites

Process Reward Models (PRMs)
Chain-of-Thought (CoT)
Monte Carlo (MC) Estimation
Supervised Fine-Tuning (SFT)
Test-Time Scaling (TTS)

Key Terms

PRM: Process Reward Model—a model that evaluates the correctness of intermediate steps in a reasoning chain.

GenPRM: Generative Process Reward Model—the authors' proposed method that reasons and writes code to verify steps.

RPE: Relative Progress Estimation—a labeling method that compares the Monte Carlo success rate of the current step against the previous step to determine correctness.

Test-Time Scaling: Improving model performance during inference by increasing computational cost (e.g., generating multiple samples and voting).

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps.

SFT: Supervised Fine-Tuning—training a model on labeled examples.

Pass@1: The accuracy when generating a single solution.

Maj@N: Majority voting accuracy over N generated solutions/paths.

Critic: A model role where the LLM provides feedback to refine a generated solution rather than just scoring it.