Let's reward step by step: Step-Level reward model as the Navigators for Reasoning

📝 Paper Summary

Multi-step Reasoning Reward Modeling Inference-time Search

HGS-PRM improves multi-step reasoning accuracy by using a process-supervised reward model to guide a heuristic greedy search during inference, validating intermediate steps before proceeding.

Core Problem

Large Language Models suffer from cascading errors in multi-step reasoning tasks, where a single incorrect step invalidates the entire subsequent reasoning chain.

Why it matters:

Current reasoning approaches like Chain of Thought (CoT) lack mechanisms to correct errors mid-generation, leading to error propagation
Existing search methods like BFS/DFS or self-reflection can be computationally prohibitive or get stuck in repetitive loops due to context window limits

Concrete Example: In a math problem requiring complex number calculation |(1-i)^8|, a standard model might make a calculation error in step 2 (e.g., calculating (1-i)^2 incorrectly). Without feedback, the model builds on this wrong intermediate value, inevitably leading to a wrong final answer (e.g., 32 instead of 16).

Key Novelty

Heuristic Greedy Search with Process-Supervised Reward Model (HGS-PRM)

Deploys a step-level reward model (PRM) as a navigator during *inference* (decoding) rather than just for training (RLHF), evaluating every generated step
Uses a greedy search algorithm that backtracks when the PRM detects a negative step, constraining the search space compared to exhaustive methods like BFS
Introduces an automated pipeline to generate step-level reward data for code using Abstract Syntax Tree (AST) mutation and unit testing

Architecture

The workflow of the Heuristic Greedy Search assisted by the Process-Supervised Reward Model (HGS-PRM)

Evaluation Highlights

+4.9% improvement in pass@1 on HumanEval using Code-LLaMA-Python-7B compared to Chain of Thought (CoT)
+3.3% accuracy improvement on MATH benchmark using WizardMath-13B compared to CoT
+2.2% accuracy improvement on GSM8K benchmark using WizardMath-13B compared to CoT

Breakthrough Assessment

7/10

Solid application of PRMs to inference-time search with demonstrable gains. The automated generation of code PRM data via mutation testing is a clever, scalable contribution.

⚙️ Technical Details

Problem Definition

Setting: Multi-step reasoning tasks (Math and Code generation) where the solution path S can be decomposed into steps s_1...s_i

Inputs: Natural language question or problem statement x

Outputs: A complete reasoning path ending in the final answer or code solution

Pipeline Flow

Input Processing: Receive question x
Expansion: LLM generates candidate next step s_i+1
Evaluation: PRM scores s_i+1 (Positive, Neutral, Negative)
Decision: If positive, accept; if negative, resample or backtrack; if neutral, conditional acceptance
Termination: Output result upon reaching end-of-sequence or max iterations

System Modules

Generator

Generates the next potential reasoning step based on the current context path

Model or implementation: LLaMA-2 or WizardMath (Math), Code-LLaMA (Code)

Navigator (PRM)

Evaluates the correctness of the generated step

Model or implementation: LLaMA-7B fine-tuned as classifier (3 classes: positive, neutral, negative)

Search Controller

Implements the Heuristic Greedy Search algorithm (Expand, Backup, Prune)

Model or implementation: Algorithmic Logic (HGS)

Novel Architectural Elements

Integration of a standalone PRM verifier inside a greedy search loop during inference
Backtracking mechanism triggered specifically by 'negative' PRM scores on all candidate sub-nodes

Modeling

Base Model: LLaMA-2-7B / 13B, WizardMath-7B / 13B, Code-LLaMA-Python-7B / 13B

Training Method: Supervised Fine-Tuning (SFT) followed by Reward Modeling

Objective Functions:

Purpose: Train the PRM to classify steps.

Formally: Standard classification loss over labels {Positive, Neutral, Negative}.

Training Data:

Math: PRM800K dataset (based on MATH)
Code: Generated from MBPP using AST mutation. Positive = Ground truth lines. Negative = Mutated lines that fail unit tests. Neutral = Mutated lines that pass unit tests.

Key Hyperparameters:

inference_temperature_math: 0.1
inference_top_p_math: 0.95
inference_temperature_code: 0.2
+ 1 more
inference_top_p_code: 0.95

Compute: Not reported in the paper

Comparison to Prior Work

vs. CoT: HGS-PRM adds explicit step-level verification and backtracking
vs. ToT/RAP: HGS-PRM uses a specialized Reward Model (cheaper inference) instead of LLM self-reflection (expensive) and uses a greedy heuristic to prune the search space more aggressively

Limitations

Dependency on the quality of the PRM; if PRM accuracy is low, it can degrade performance (shown in filtering analysis)
Mutation testing for code data generation focuses primarily on atomic operators, limiting the generality of the code PRM
Computational cost is higher than simple CoT due to the search/backtracking process
Neutral label discrimination in PRM was found to be inadequate

Reproducibility

Paper states 'We have released a PRM dataset specifically for code', but no URL is found in the text. The PRM training methodology is described (AST mutation), and open-source base models (LLaMA, Code-LLaMA, WizardMath) are used.

📊 Experiments & Results

Evaluation Setup

Multi-step reasoning on Math and Code benchmarks

Benchmarks:

GSM8K (Grade school math word problems)
MATH (Challenging mathematics problems)
HumanEval (Python code generation)

Metrics:

Accuracy
pass@1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Math reasoning results showing improvements of HGS-PRM over CoT baseline across standard and math-specialized models.
GSM8K	Accuracy	63.2%	65.4%	+2.2%
GSM8K	Accuracy	31.7%	32.9%	+1.2%
MATH	Accuracy	10.4%	13.7%	+3.3%
Code generation results demonstrating that the mutation-based PRM data effectively guides code synthesis.
HumanEval	pass@1	36.6%	41.5%	+4.9%
HumanEval	pass@1	41.5%	44.5%	+3.0%

Experiment Figures

Precision, Recall, and Penalty Miss Rate for Math and Code PRMs

Main Takeaways

Integrating PRM into greedy search (HGS-PRM) consistently outperforms Chain of Thought (CoT) across both Math and Code tasks
Math-specific models (WizardMath) benefit more from the reward model than general models (LLaMA), suggesting alignment between generator and verifier capabilities is important
Automated PRM data generation via mutation testing is effective, yielding higher accuracy improvements in Code tasks compared to Math
Filtering sampled paths using PRM scores significantly increases accuracy (e.g., from 4.25% to 14.4% on MATH), validating that PRM scores correlate with correctness

📚 Prerequisite Knowledge

Prerequisites

Understanding of Chain of Thought (CoT) prompting
Familiarity with Reinforcement Learning from Human Feedback (RLHF) concepts
Basic knowledge of search algorithms (Greedy, BFS)

Key Terms

PRM: Process-Supervised Reward Model—a model trained to score individual steps of a reasoning process (positive/neutral/negative) rather than just the final outcome

HGS-PRM: Heuristic Greedy Search with PRM—the authors' proposed algorithm that uses PRM scores to decide whether to keep expanding a reasoning path or backtrack

CoT: Chain of Thought—a prompting technique that encourages LLMs to generate intermediate reasoning steps

AST: Abstract Syntax Tree—a tree representation of the abstract syntactic structure of source code, used here to programmatically mutate code for data generation

Mutation Testing: A software testing technique where 'mutants' (modified versions of code) are created to test the robustness of a test suite; used here to generate 'negative' code examples

pass@1: A metric for code generation measuring the percentage of problems where the first generated solution passes unit tests

RLHF: Reinforcement Learning from Human Feedback—training method to align models using reward models derived from human preferences

GSM8K: Grade School Math 8K—a dataset of grade school math word problems

MBPP: Mostly Basic Python Problems—a benchmark dataset for code generation

SFT: Supervised Fine-Tuning—training a pre-trained model on specific task data (e.g., math instructions)