Enhancing LLM Reasoning with Reward-guided Tree Search

📝 Paper Summary

LLM Reasoning Test-time Scaling Tree Search Algorithms

STILL-1 improves LLM mathematical reasoning by integrating a policy model with a reward-guided tree search algorithm that explores diverse solution paths and refines strategies via iterative mutual improvement.

Core Problem

LLMs struggle with complex reasoning tasks like STEM problems because they typically generate tokens sequentially without exploring alternative paths or correcting intermediate errors.

Why it matters:

Standard LLMs lack the 'System 2' slow thinking capability required for multi-step logic, often failing on Olympiad-level math despite high general proficiency.
Reproducing o1-like reasoning capabilities is challenging due to undisclosed technical details regarding search integration and reward guidance.
Existing methods often lack a comprehensive framework that unifies policy optimization, robust reward modeling, and effective search algorithms.

Concrete Example: In a multi-step math problem, a standard LLM might make a calculation error in step 2 and propagate it to the end. The proposed framework uses a tree search to detect low reward scores at that node, backtrack, and explore a correct alternative path.

Key Novelty

STILL-1 (Slow Thinking with LLMs)

Implements a modular framework combining a policy model adapted for node-based reasoning, a generative reward model for scoring, and an MCTS-based search algorithm.
Introduces a mutual iterative refinement process where the policy model generates data to train the reward model, which in turn guides the policy's preference optimization.
Enhances MCTS with a global selection strategy (MCTSG) that considers all leaf nodes rather than just local children, preventing the search from getting stuck in suboptimal local paths.

Architecture

The conceptual framework of STILL-1 comprising Policy Model, Reward Model, and Search Algorithm.

Evaluation Highlights

Significantly enhances performance on MATH-OAI, GSM-Hard, Olympiad Bench, and College Math compared to the base policy model.
Demonstrates that the reward-guided search effectively trades test-time compute for higher accuracy on complex reasoning tasks.
Ablation studies confirm the effectiveness of the global selection strategy (MCTSG) over standard MCTS and Beam Search in difficult scenarios.

Breakthrough Assessment

7/10

Provides a valuable, transparent reproduction study of o1-like reasoning mechanisms with detailed technical explorations of MCTS, reward modeling, and iterative training, though it primarily combines existing techniques rather than inventing fundamentally new ones.

⚙️ Technical Details

Problem Definition

Setting: Mathematical problem solving where a policy generates reasoning steps structured as a tree search.

Inputs: A mathematical problem q described in text.

Outputs: A final answer derived from a selected solution path s.

Pipeline Flow

Policy Model (Generates candidate steps)
Search Algorithm (Constructs/Navigates tree)
Reward Model (Evaluates nodes)
Calculator Tool (Verifies computations)

System Modules

Policy Model

Generates new reasoning steps (nodes) given a partial solution path.

Model or implementation: LLaMA-3.1-8B-Instruct (fine-tuned)

Search Algorithm

Manages the tree structure, selecting nodes to expand based on UCB and reward scores.

Model or implementation: MCTS / MCTSG

Reward Model

Evaluates generated steps/solutions to guide the search.

Model or implementation: LLaMA-3.1-8B-Instruct (generative, outcome-supervised)

Calculator Tool

Verifies and corrects numerical calculations within reasoning steps.

Model or implementation: SymPy

Novel Architectural Elements

MCTSG: A modified selection mechanism in MCTS that dynamically calculates a threshold based on the distribution of all leaf node values to select promising candidates globally.
Calculator Integration: Explicit integration of a Python-based calculator (SymPy) within the search node expansion to correct arithmetic on the fly.

Modeling

Base Model: LLaMA-3.1-8B-Instruct (for both Policy and Reward models)

Training Method: Iterative Mutual Refinement

Objective Functions:

Purpose: Optimize the policy model to prefer higher-quality solutions.

Formally: DPO loss L_DPO = -E[log sigma(beta * log(pi(s_w)/ref(s_w)) - beta * log(pi(s_l)/ref(s_l)))].
Purpose: Train the reward model to discriminate correct/incorrect solutions.

Formally: Standard causal language modeling loss on formatted assessment text (Yes/No).

Adaptation: Full fine-tuning (implied by iterative DPO)

Training Data:

NuminaMath dataset used for seed data.
Qwen2.5-Math-72B-Instruct used to generate initial formatted solutions.
Iterative self-generated data: Policy generates solutions, Reward scores them to create preference pairs (DPO) and labeled data (RM training).

Key Hyperparameters:

dpo_beta: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. CoT/SC: STILL-1 allows backtracking and lookahead via tree search.
vs. Beam Search: STILL-1 explores the solution space more flexibly using UCB and value estimation.
vs. Standard MCTS: STILL-1 uses MCTSG (global selection) to better handle cases where local children are all poor or all good.

Limitations

Computational cost is high due to multiple rollouts and tree expansion during inference.
The reward model is outcome-supervised (ORM), which might be less precise than process-supervised (PRM) for long chains, though the paper argues ORM is more efficient to train.
Relies on a calculator tool for arithmetic, which is an external dependency.

Reproducibility

The paper describes the algorithms (MCTS, MCTSG), prompt templates, and data construction pipelines (using NuminaMath and Qwen2.5-Math-72B) in detail. Code availability is not provided.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning on challenging datasets.

Benchmarks:

MATH-OAI (Competition-level Mathematics)
GSM-Hard (Hard Mathematical Reasoning)
Olympiad Bench (Olympiad-level Mathematics)
College Math (College-level Mathematics)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The experiments demonstrate that the proposed STILL-1 framework significantly improves accuracy across all four benchmarks compared to the base LLaMA-3.1-8B-Instruct model.
MATH-OAI	Accuracy	52.4	62.4	+10.0
GSM-Hard	Accuracy	56.3	67.5	+11.2
Olympiad Bench	Accuracy	20.1	27.8	+7.7
College Math	Accuracy	31.5	39.2	+7.7

Experiment Figures

Performance comparison of different search algorithms (Beam Search, MCTS, MCTSG) across datasets.

Main Takeaways

The reward-guided tree search framework consistently outperforms the base model across various difficulty levels, confirming the value of test-time scaling.
The global selection mechanism (MCTSG) generally outperforms standard MCTS and Beam Search, especially on harder datasets where local optima are deceptive.
Iterative mutual refinement of policy and reward models leads to progressive performance gains, validating the training strategy.
Integrating a calculator tool and self-consistency scores further boosts the robustness of the reasoning process.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and Instruction Tuning
Reinforcement Learning (RL) concepts (Policy, Reward)
Monte Carlo Tree Search (MCTS)
Direct Preference Optimization (DPO)

Key Terms

MCTS: Monte Carlo Tree Search—a heuristic search algorithm that expands the most promising moves based on random sampling of the search space.

DPO: Direct Preference Optimization—a method to fine-tune language models to align with human preferences without explicitly training a reward model first.

MCTSG: MCTS with Global selection—a modification of MCTS proposed in this paper that selects nodes from all available leaf nodes based on value distribution, rather than just local children.

UCB: Upper Confidence Bound—a formula used in search algorithms to balance exploration (trying less-visited paths) and exploitation (using high-reward paths).

SC: Self-Consistency—a technique where the model generates multiple reasoning paths and selects the most frequent answer as the final output.

STILL-1: Slow Thinking with LLMs—the specific implementation of the reasoning framework presented in this paper.

ORM: Outcome-based Reward Model—a model trained to predict the correctness of the final answer rather than individual steps.

PRM: Process-based Reward Model—a model trained to evaluate the correctness of intermediate reasoning steps.