Value-Guided Search for Efficient Chain-of-Thought Reasoning

📝 Paper Summary

Test-Time Compute (TTC) Scaling Reasoning Models Reward Modeling

VGS scales test-time compute by training a token-level value model on outcome-labeled traces—bypassing expensive step-level annotations—and guiding block-wise search with weighted majority voting.

Core Problem

Scaling test-time compute for reasoning models is hindered because defining "steps" for Process Reward Models (PRMs) is ambiguous in long contexts, and collecting step-wise labels is prohibitively expensive.

Why it matters:

State-of-the-art reasoning models like DeepSeek-R1 require massive inference compute due to long Chain-of-Thought (CoT) traces
Existing PRM methods struggle to scale because they rely on human or LLM-judge annotations for every step, which is costly and hard to define for continuous reasoning streams
Models often get stuck in unproductive loops or generate repetitive content without granular guidance

Concrete Example: When a reasoning model generates a long solution, standard PRMs look for newlines to define 'steps' to score. If the model outputs a 4000-token block of dense reasoning without clear delimiters, step-based PRMs fail to provide feedback. VGS avoids this by scoring arbitrary token blocks using a value model trained on final outcomes.

Key Novelty

Value-Guided Search (VGS)

Trains a token-level value model using 'regression via classification' on full reasoning traces, predicting whether a partial trace will lead to a correct answer without needing intermediate step labels
Performs block-wise beam search where the value model scores fixed-length token blocks (e.g., 4096 tokens), selecting the most promising paths to continue generation
Aggregates final search results using Weighted Majority Voting (WMV) rather than Best-of-N, leveraging value scores to weight consensus

Architecture

Data collection pipeline (Left) and Beam Search process (Right).

Evaluation Highlights

VGS on DeepSeek-R1-Distill-Qwen-14B (total 15.5B params) matches the performance of the 671B DeepSeek-R1 on AIME/HMMT benchmarks with a budget of 64 generations
VGS reduces average response length by over 12% (11,219 tokens vs 12,793) compared to the base DeepSeek-1.5B model while improving accuracy
DeepSeek-VM-1.5B outperforms 7B baseline PRMs (Math-Shepherd and Qwen2.5-Math) when used for weighted majority voting or search guidance

Breakthrough Assessment

8/10

Offers a highly practical recipe for scaling test-time compute without expensive human/LLM step-labels. Matches 671B performance with ~15B models via search. The open-sourced 2.5M dataset is a significant resource.

⚙️ Technical Details

Problem Definition

Setting: Math reasoning tasks where a model generates a solution $y$ for prompt $x$, evaluated by a binary reward $r$.

Inputs: Math problem prompt $x$ (from AIME/HMMT)

Outputs: Final answer (parsed from boxed text in the generated solution)

Pipeline Flow

Generator (DeepSeek) samples parallel blocks
Value Model scores blocks
Beam Search selects top-k blocks
Repeat until completion
Weighted Majority Vote aggregates final answers

System Modules

Generator (Search & Generation)

Generate candidate reasoning blocks (sequences of tokens)

Model or implementation: DeepSeek-R1-Distill-Qwen (1.5B, 7B, or 14B)

Value Model

Predict the probability that a partial trace leads to a correct answer

Model or implementation: DeepSeek-VM-1.5B (Classifier)

Beam Selector (Search & Generation)

Prune the search space by keeping only the highest-value partial traces

Model or implementation: Algorithm (Beam Search)

Aggregator

Select the final answer from completed paths

Model or implementation: Weighted Majority Vote (WMV)

Novel Architectural Elements

Block-level guidance mechanism: Value model scores fixed-token-length blocks rather than semantic steps (newlines)
Integration of Weighted Majority Voting (WMV) as the final aggregation step for Beam Search (replacing standard max-score selection)

Modeling

Base Model: DeepSeek-R1-Distill-Qwen-1.5B (used as architecture for Value Model)

Training Method: Regression via Classification (Supervised Learning)

Objective Functions:

Purpose: Train classifier to predict final outcome given partial trace.

Formally: Cross-entropy loss $\ell_{ce}(\hat{p}, \kappa)$ where $\kappa \in \{0, 1, 2\}$ (Incorrect, Correct, Incomplete).

Training Data:

Source: OpenR1-Math (filtered)
Roll-ins: Sampled from DeepSeek-R1-Distill-Qwen {1.5B, 7B, 14B, 32B}
Roll-outs: Generated by DeepSeek-R1-Distill-Qwen-1.5B
Labels: Binary correctness checked via math-verify
Volume: 2.5 million traces

Key Hyperparameters:

block_size: 4096 (for search)
beam_width: 2 (for search)
inference_budget_N: 64 (for comparison with 671B model)

Compute: Value model scoring FLOPs are negligible (<<1%) compared to generation cost.

Comparison to Prior Work

vs. Math-Shepherd/PRM800K: VGS does not require step-level definitions or labels; trains on full traces [cited in paper]
vs. DeepSeek-R1: VGS achieves similar performance with a much smaller model (15.5B vs 671B) by using search budget [cited in paper]
vs. Best-of-N: VGS uses Weighted Majority Vote for aggregation, which is shown to be superior to taking the single max-score response [cited in paper]

Limitations

Value model generalization gap: The 1.5B VM is less effective at guiding the 14B model than the 7B or 1.5B models (OOD roll-ins)
Domain specific: Pipeline relies on automated outcome verification (e.g., math/code), harder to apply to open-ended creative writing
Fixed block size: Uses a constant block size (e.g., 4096 tokens) rather than dynamic step detection

Reproducibility

Code: https://github.com/kaiwenw/value-guided-search

Highly reproducible: Code, 2.5M dataset (OpenR1-VM), and DeepSeek-VM-1.5B model are open-sourced at https://github.com/kaiwenw/value-guided-search. Filtering scripts and distributed training scripts are included.

📊 Experiments & Results

Evaluation Setup

High-school competition mathematics (AIME, HMMT)

Benchmarks:

AIME (2024 & 2025) (Competition Math)
HMMT (2024 & 2025) (Competition Math)

Metrics:

Accuracy (Pass@1 with search/voting)
Average Response Length (tokens)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Average across AIME/HMMT	Average Response Length (tokens)	12793	11219	-1574
AIME/HMMT	Qualitative Comparison	Not reported as exact number	Not reported as exact number	0

Experiment Figures

Accuracy vs. Inference FLOPs scaling curve.

Main Takeaways

VGS enables smaller models (14B) to match the performance of massive models (671B) by trading inference-time compute for parameter count
Weighted Majority Vote is a critical component; VGS outperforms Best-of-N and standard Majority Voting consistently
Token-level value models trained on outcomes can effectively guide search without needing expensive step-level annotations
Search guidance reduces the incidence of 'looping' behaviors, leading to shorter, more efficient reasoning traces

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) reasoning
Process Reward Models (PRMs) vs Outcome Reward Models (ORMs)
Beam Search and Best-of-N sampling
Reinforcement Learning (RL) concepts (Value functions)

Key Terms

VGS: Value-Guided Search—a search method where a value model scores partial generation blocks to guide the beam search process

PRM: Process Reward Model—a model that scores intermediate steps of reasoning rather than just the final outcome

ORM: Outcome Reward Model—a model that predicts the final reward (correctness) of a complete response

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer

TTC: Test-Time Compute—techniques to improve model performance during inference (e.g., by sampling more answers or searching), rather than during training

WMV: Weighted Majority Vote—an aggregation method where answers are voted on, but each vote is weighted by the model's confidence/value score

BoN: Best-of-N—a sampling strategy where N solutions are generated and the one with the highest reward model score is selected

Roll-out: Generating a completion from a specific point in the reasoning trace to estimate the value of that state

Roll-in: The partial reasoning trace generated up to the current point

Block-wise search: Search strategy that generates and scores chunks of tokens (blocks) at a time, rather than sentence-by-sentence or token-by-token