rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

📝 Paper Summary

Mathematical Reasoning Small Language Models (SLMs) Synthetic Data Generation

rStar-Math enables small language models to match OpenAI o1's math performance by iteratively generating high-quality, code-verified training data through MCTS and training a robust process preference model from scratch.

Core Problem

Small language models struggle with complex math reasoning because high-quality step-by-step training data is scarce, and existing distillation methods inherit the limitations of teacher models.

Why it matters:

Distilling from larger models (like GPT-4) has diminishing returns and cannot help models surpass the teacher's capability
Standard Chain-of-Thought generation often contains subtle logic errors in intermediate steps even if the final answer is correct, degrading training quality
Training process reward models is difficult because precise step-level human annotation is expensive and automatic scoring is inherently noisy

Concrete Example: In a complex math problem, an LLM might hallucinate a formula step but arrive at the correct answer by chance. Standard training would treat this entire trajectory as 'correct', reinforcing the hallucination. rStar-Math uses code execution to verify each step and MCTS rollouts to identify that this specific step rarely leads to success, filtering it out.

Key Novelty

Self-Evolved Deep Thinking via Code-Augmented MCTS

Code-Augmented MCTS: Generates reasoning steps interleaved with Python code; only steps with successfully executing code are retained, filtering out hallucinations before reward scoring
Process Preference Model (PPM): Replaces noisy absolute scoring of steps with a pairwise ranking approach, learning to prefer steps that lead to correct MCTS outcomes over those that do not
Self-Evolution Recipe: A 4-round iterative loop where the policy and reward models generate their own improved training data to progressively tackle harder problems without external distillation

Architecture

The 4-round self-evolution pipeline of rStar-Math. It illustrates the cycle of MCTS data generation, trajectory selection via Q-values, and training of the Policy SLM and PPM.

Evaluation Highlights

Improves Qwen2.5-Math-7B from 58.8% to 90.0% on the MATH benchmark (pass@1 with 64 searches), surpassing o1-preview (85.5%)
Solves 53.3% (8/15) of problems on the Olympiad-level AIME 2024 benchmark, outperforming o1-preview (46.7%) and base Qwen2.5-Math-7B (13.3%)
Boosts smaller models significantly: Qwen2.5-Math-1.5B improves from 51.2% to 87.8% on MATH with 64 search trajectories

Breakthrough Assessment

9/10

Achieves SOTA math reasoning on small 7B models, surpassing o1-preview without distillation. The self-evolution recipe and PPM formulation offer a reproducible path for SLM scaling.

⚙️ Technical Details

Problem Definition

Setting: Complex mathematical reasoning using small language models via test-time search

Inputs: Math problem x

Outputs: Step-by-step solution trajectory ending in final answer

Pipeline Flow

Selection (Traverse tree using UCT)
Expansion (Generate next-step candidates)
Evaluation (Score candidates via PPM)
Backpropagation (Update Q-values)

System Modules

Node Selector

Select the most promising node to expand next using UCT scores derived from PPM predictions

Model or implementation: Algorithmic (UCT formula)

Policy Generator

Generate candidate next steps interleaved with Python code

Model or implementation: Policy SLM (e.g., Qwen2.5-Math-7B)

Process Preference Model (PPM)

Predict the quality (reward) of generated steps to guide the search

Model or implementation: Reward SLM (initialized from Policy)

Novel Architectural Elements

Code-augmented expansion: MCTS nodes are strictly filtered by Python execution success before being added to the tree
PPM-guided search: Uses a pairwise-trained preference model for node evaluation instead of a standard value network trained on MSE

Modeling

Base Model: Qwen2.5-Math-7B (also tested on Phi3-mini-3.8B, Qwen2.5-Math-1.5B)

Training Method: Iterative Self-Evolution (4 Rounds)

Objective Functions:

Purpose: Train the Policy SLM to generate correct reasoning paths.

Formally: Standard SFT loss on high-quality trajectories selected via MCTS Q-values (Top-2 correct paths)
Purpose: Train the PPM to distinguish better steps.

Formally: Pairwise ranking loss L = -log(sigmoid(r(x, y_pos) - r(x, y_neg))), where y_pos and y_neg are steps with high/low MCTS Q-values

Training Data:

747k math problems from NuminaMath and MetaMath
Trajectories generated via MCTS rollouts (16-128 per problem)
Filtered by correctness and Q-value verification

Key Hyperparameters:

mcts_rollouts_round_1_2: 8-16
mcts_rollouts_round_4: Up to 128
sft_selection_criteria: Top-2 trajectories with highest average Q-values among correct solutions

Compute: MCTS rollouts performed on 4x40GB A100 GPUs

Comparison to Prior Work

vs. OpenAI o1: rStar-Math achieves comparable performance using much smaller models (7B) via explicit code-augmented MCTS rather than implicit internal reasoning chains
vs. Math-Shepherd: rStar-Math trains a Process Preference Model (ranking) rather than a value function (absolute scoring) and incorporates code execution for verification
vs. Distillation methods (NuminaMath): Generates its own data from scratch (self-evolution) rather than relying on GPT-4 outputs

Limitations

Computational cost of MCTS inference is higher than standard generation (System 2 vs System 1)
Relies on the availability of ground truth final answers to verify trajectories during training data generation
Effectiveness of self-evolution loop terminates around Round 4 due to data saturation/quality limits of the problem set

Reproducibility

Code: https://github.com/microsoft/rStar

Code and data available at https://github.com/microsoft/rStar. The paper details the exact number of rollouts per round and the selection criteria for training data.

📊 Experiments & Results

Evaluation Setup

Math reasoning tasks evaluated using test-time search (MCTS) with the trained Policy and PPM

Benchmarks:

MATH (Competition-level math problems)
GSM8K (Grade school math word problems)
AIME 2024 (Olympiad-level math competition)
OlympiadBench (Olympiad-level math problems)

Metrics:

Accuracy (%)
Pass@1 (with varying search budgets)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on the challenging MATH benchmark showing rStar-Math's improvement over base models and comparison to proprietary SOTA.
MATH	Accuracy (Pass@1)	58.8	90.0	+31.2
MATH	Accuracy (Pass@1)	51.2	87.8	+36.6
Results on Olympiad-level AIME 2024 showing capability on very hard problems.
AIME 2024	Accuracy	46.7	53.3	+6.6
Ablation study demonstrating the superiority of the Process Preference Model (PPM) over other reward modeling approaches.
MATH	Accuracy	84.2	86.6	+2.4

Experiment Figures

Illustration of the Code-Augmented CoT synthesis method within MCTS.

Main Takeaways

Small LLMs (1.5B-7B) can achieve state-of-the-art math reasoning via deep thinking (MCTS) without needing distillation from larger models
The self-evolution process is highly effective: each round of generating data with the current best model yields better training data for the next round
Process Preference Models (PPM) trained with ranking loss are more effective than standard Pointwise Reward Models (PRM) trained with MSE loss for guiding MCTS
Code execution is a critical filter for reasoning steps, significantly improving the quality of synthetic CoT data

📚 Prerequisite Knowledge

Prerequisites

Monte Carlo Tree Search (MCTS)
Reinforcement Learning (Reward Modeling)
Chain-of-Thought (CoT) Prompting

Key Terms

MCTS: Monte Carlo Tree Search—a search algorithm that builds a decision tree by iteratively simulating future outcomes to find optimal moves

PPM: Process Preference Model—a reward model trained to rank one reasoning step over another based on likelihood of success, rather than assigning an absolute score

SLM: Small Language Model—typically models with 7 billion parameters or fewer

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

Q-value: In MCTS, the estimated value (expected future reward) of taking a specific action (reasoning step) from a given state

Rollout: A simulation in MCTS where the model generates a sequence of steps from a current node to a terminal state to estimate value

UCT: Upper Confidence Bound for Trees—a formula used in MCTS to balance exploring new paths and exploiting known good paths