ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search

📝 Paper Summary

LLM Reasoning Reinforcement Learning from AI Feedback (RLAIF)

ReST-MCTS∗ is a self-training framework that uses tree search and estimated process rewards to automatically generate high-quality reasoning traces and per-step value labels for refining LLM reasoning.

Core Problem

Existing self-training methods rely on final-answer correctness, often keeping traces with wrong reasoning but correct answers (false positives), while training per-step process verifiers typically requires expensive human annotation.

Why it matters:

False positive reasoning traces (correct answer, wrong logic) degrade model performance on complex tasks
Manual annotation for Process Reward Models (PRMs) is unscalable, limiting the ability to verify intermediate reasoning steps
Sparse rewards (only at the end) make credit assignment difficult for long reasoning chains

Concrete Example: An LLM might solve a math problem by making two calculation errors that cancel each other out, arriving at the correct final number. Standard self-training (like STaR) treats this trace as 'correct' training data, teaching the model bad math. ReST-MCTS∗ detects the low probability of the intermediate steps leading to a correct answer and filters it out.

Key Novelty

Auto-labeled Process Rewards via Tree Search Statistics

Uses MCTS∗ (a modified Monte Carlo Tree Search) to explore many reasoning paths; the probability of a partial step leading to a correct answer becomes its 'process reward' label
Circumvents manual labeling by using these search-derived statistics to train a Process Reward Model (PRM) and a Policy Model in a loop
Introduces 'reasoning distance' (estimated steps to solution) to weight rewards, prioritizing steps that make progress toward the solution

Architecture

The iterative cycle of ReST-MCTS∗. It shows the interaction between the Policy Model, the Process Reward Model, and the MCTS∗ search process.

Evaluation Highlights

Outperforms Self-Rewarding LM by +6.2% accuracy on the difficult MATH benchmark
MCTS∗ search policy achieves 91.2% accuracy on GSM8K, outperforming Tree-of-Thought (85.2%) and Best-of-N (87.8%) given the same budget
Learned Process Reward Model achieves 72.8% accuracy in selecting correct reasoning steps, surpassing Math-Shepherd (66.8%)

Breakthrough Assessment

8/10

Significantly automates PRM training without human labels, addressing a major bottleneck in reasoning. Strong empirical gains on hard benchmarks like MATH and SciBench.

⚙️ Technical Details

Problem Definition

Setting: Multi-step reasoning where a policy generates a sequence of steps to solve a problem Q

Inputs: Natural language question Q

Outputs: Reasoning trace (s1, s2, ..., sK) and final answer

Pipeline Flow

Iterative Loop: MCTS∗ Search (Data Collection) → Filter Traces → Train PRM & Policy
Generation: Policy Model + PRM → MCTS∗ Search Tree
Annotation: Tree Statistics → Process Reward Labels
Training: Labeled Traces → Update Policy & PRM

System Modules

Policy Model

Generates next-step candidates for the reasoning tree

Model or implementation: LLM (e.g., Llama-2-7B or Mistral-7B)

Process Reward Model (PRM)

Predicts the value/quality of a partial solution to guide search

Model or implementation: LLM with scalar head

MCTS∗ Search

Explores the reasoning space using the Policy and PRM to find correct solutions

Model or implementation: Algorithm (Selection, Expansion, Rollout, Backpropagation)

Novel Architectural Elements

Dual-purpose inferred rewards: Search statistics (probability of reaching correct answer) are used as ground-truth labels for training the PRM
Integration of 'reasoning distance' (steps to solution) into the value function definition to prioritize efficient paths

Modeling

Base Model: Mistral-7B-Instruct-v0.2 and Llama-2-7B/13B

Training Method: MuZero-style iterative self-training

Objective Functions:

Purpose: Train the Policy Model to generate correct reasoning steps.

Formally: Standard Language Modeling (Cross-Entropy) loss on filtered high-quality traces.
Purpose: Train the Process Reward Model to predict the probability of a step leading to a correct answer.

Formally: Mean Squared Error (MSE) between predicted value v_theta and target value derived from MCTS statistics.

Adaptation: Fine-tuning

Training Data:

Initial Policy/PRM warm-up using base datasets
Iterative data generation: MCTS∗ runs on training questions to produce trees
Filtering: Traces leading to correct final answers are kept; step values calculated via rollout success rates

Key Hyperparameters:

num_iterations: 3
search_budget: Same budget constraint enforced for baselines (e.g., number of node expansions)
discount_factor_lambda: Not explicitly reported in the paper (implied by value updates)

Compute: Not reported in the paper

Comparison to Prior Work

vs. ReST-EM: ReST-MCTS∗ trains a PRM and Policy jointly, whereas ReST-EM only trains the Policy
vs. Math-Shepherd: ReST-MCTS∗ iteratively improves the policy using the search traces, whereas Math-Shepherd focuses on training the verifier
vs. ToT: ReST-MCTS∗ uses a trained PRM for value guidance instead of prompting the LLM for evaluation
+ 1 more
vs. Self-Rewarding LM: Uses tree-search statistics for objective ground-truth value estimation rather than relying on the LLM's subjective self-evaluation

Limitations

Depends on the availability of ground-truth final answers to verify traces (requires labeled datasets like MATH/GSM8K)
Computational cost of MCTS∗ during training data generation is high compared to simple sampling
Does not explicitly report wall-clock training time or GPU resources required

Reproducibility

Code: https://github.com/THUDM/ReST-MCTS

Code is publicly available at https://github.com/THUDM/ReST-MCTS. The paper details the algorithms for value calculation and search. Specific hyperparameters for training (LR, batch size) are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Mathematical and scientific reasoning tasks

Benchmarks:

GSM8K (Grade School Math)
MATH (Challenging Math Problems)
SciBench (Scientific Reasoning (Physics, Chem, Math))

Metrics:

Accuracy (Pass@1)
RM Selection Accuracy (Step-level correctness)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Policy Performance: ReST-MCTS∗ outperforms other self-training methods on the challenging MATH benchmark after 3 iterations.
MATH	Accuracy	33.6	39.8	+6.2
MATH	Accuracy	24.6	39.8	+15.2
Search Efficiency: MCTS∗ finds correct answers more effectively than other inference-time search strategies given the same compute budget.
GSM8K	Accuracy	85.2	91.2	+6.0
SciBench	Accuracy	57.8	65.4	+7.6
Reward Model Quality: The inferred process rewards create a more accurate verifier than previous methods.
MATH (Subset)	Step Selection Accuracy	66.8	72.8	+6.0

Experiment Figures

Comparison of search accuracy vs. search budget (number of responses/nodes) for varying methods (MCTS∗, Best-of-N, Self-Consistency) on SciBench and MATH.

Main Takeaways

ReST-MCTS∗ effectively scales up supervision: MCTS-derived labels are higher quality than simple outcome-based labels or LLM self-evaluation.
Iterative improvement works: Both the Policy and the PRM improve over 3 rounds of mutual self-training.
Guidance matters: The learned PRM allows the search algorithm (MCTS∗) to prune bad branches early, making it much more efficient than Best-of-N or ToT.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals (Policy, Value Function)
Monte Carlo Tree Search (MCTS)
Large Language Models (LLMs) for reasoning (Chain-of-Thought)

Key Terms

PRM: Process Reward Model—a model that evaluates the correctness of each intermediate step in a reasoning chain, rather than just the final answer

MCTS: Monte Carlo Tree Search—a search algorithm that builds a decision tree by simulating future outcomes to find optimal moves

MCTS∗: The paper's modified search algorithm that uses a learned value function for guidance and updates values based on search rollouts

rollout: Simulating a reasoning path from a current state to a final outcome to estimate the value of that state

reasoning distance: The estimated number of steps remaining from the current state to reach a correct solution; used to weight rewards

BoN: Best-of-N—a strategy where the model generates N solutions and a verifier selects the best one

CoT: Chain-of-Thought—prompting the model to generate intermediate reasoning steps before the final answer

ToT: Tree-of-Thought—a prompting strategy that explores multiple reasoning paths in a tree structure

STaR: Self-Taught Reasoner—a self-training method where a model learns from its own correct solutions

SciBench: A benchmark dataset consisting of complex scientific reasoning problems

MATH: A dataset of challenging mathematics problems requiring multi-step reasoning

GSM8K: A dataset of grade school math word problems