DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

📝 Paper Summary

Formal Theorem Proving Mathematical Reasoning

DeepSeek-Prover-V1.5 integrates reinforcement learning from compiler feedback and a curiosity-driven tree search to generate correct formal proofs by iteratively truncating and resuming from valid intermediate states.

Core Problem

Whole-proof generation models often hallucinate intermediate states in long proofs, while step-by-step methods are computationally expensive and struggle with sparse rewards.

Why it matters:

Formal theorem proving requires rigorous correctness; a single invalid step invalidates the entire proof.
Existing models like GPT-4 struggle to align abstract reasoning with precise formal syntax (e.g., Lean 4).
Prior whole-proof generators suffer from compounding errors because they lack access to the actual intermediate states returned by the verifier.

Concrete Example: In a long proof, the model might assume a tactic `rewrite [h]` succeeds and proceeds, but if the tactic actually fails or produces an unexpected state, all subsequent generated code is invalid garbage.

Key Novelty

Truncate-and-Resume Mechanism within Monte-Carlo Tree Search (MCTS)

The system generates a proof segment, verifies it, and if an error occurs, truncates the code at the error. It then resumes generation using the verifier's actual state feedback.
RMaxTS (R-Max Tree Search): A tree search algorithm that uses intrinsic rewards (curiosity) to explore diverse proof paths even when external rewards (success/fail) are sparse.

Architecture

The truncate-and-resume inference loop coupled with tree search.

Evaluation Highlights

Achieved 63.5% pass rate on the miniF2F-test benchmark, establishing a new state-of-the-art for open-source models.
Achieved 25.3% pass rate on the undergraduate-level ProofNet benchmark (test set).
RL training improved performance from 55.7% (SFT) to 60.2% on miniF2F even without tree search, demonstrating the effectiveness of proof assistant feedback.

Breakthrough Assessment

9/10

Significant jump in SOTA performance on standard benchmarks (miniF2F, ProofNet) by effectively combining RL from compiler feedback with a novel tree search integration.

⚙️ Technical Details

Problem Definition

Setting: Formal theorem proving in Lean 4

Inputs: Theorem statement (formal specification) + optional natural language comments

Outputs: Complete, verifiable proof code in Lean 4

Pipeline Flow

Theorem Input → MCTS Selection → Expansion (Generate proof segment)
Verification (Lean 4) → Truncate at Error → Update Tree
Resume Generation with Tactic State → Repeat until solved or budget exhausted

System Modules

Policy Model

Generates proof code segments (tactics) given the current proof state and history

Model or implementation: DeepSeek-Prover-V1.5-RL (7B parameter model based on DeepSeekMath-Base)

Formal Verifier

Compiles the generated code to check correctness and return new states

Model or implementation: Lean 4 Compiler / REPL

Tree Search Agent (RMaxTS)

Orchestrates the search by selecting which nodes (proof states) to expand next based on exploration rewards

Model or implementation: RMaxTS Algorithm

Novel Architectural Elements

Truncate-and-resume mechanism acting as the state transition function in MCTS: mapping a node (proof state) and action (generated code) to a new node (next valid state or error)
Integration of natural language CoT comments directly interleaved with formal tactics in the prompt/generation loop

Modeling

Base Model: DeepSeek-Prover-V1.5-Base (initialized from DeepSeekMath-Base 7B)

Training Method: Supervised Fine-Tuning (SFT) followed by Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: SFT Loss.

Formally: Standard cross-entropy loss on proof completion tokens.
Purpose: RL Reward.

Formally: Binary reward (1 if proof checks, 0 otherwise) used in GRPO to update policy relative to group mean.

Adaptation: Full fine-tuning

Training Data:

Augmented DeepSeek-Prover-V1 dataset (Mathlib4, synthetic theorems)
Expert iteration: Model generates proofs, correct ones are added back to training
DeepSeek-Coder V2 236B used to annotate proofs with natural language CoT comments
Intermediate tactic states inserted into training data as comments

Key Hyperparameters:

learning_rate_sft: 1e-4
learning_rate_rl: 5e-6
batch_size_sft: 2048
+ 4 more
batch_size_rl: 512
kl_penalty_coefficient: 0.02
group_size_grpo: 32
max_context_length: 4096 tokens

Compute: Single A100-40G GPU used for evaluation/inference

Comparison to Prior Work

vs. DeepSeek-Prover-V1: Adds intermediate state feedback via truncate-and-resume and RLPAF [cited in paper]
vs. Lean-STaR: Integrates CoT comments directly into the proof code rather than as separate reasoning blocks [cited in paper]
vs. Copra: Uses whole-proof segments with truncation rather than strict single-step generation, improving efficiency [not cited in paper]
+ 1 more
vs. HyperTree Proof Search (HTPS): HTPS trains a value function for MCTS; DeepSeek-Prover-V1.5 uses intrinsic curiosity (RMax) without a learned value function [not cited in paper]

Limitations

Dependency on the Lean 4 verifier's speed (inference latency is bounded by compilation time).
Reward signal is binary and sparse; partial credit for 'almost correct' proofs is not utilized.
The truncate-and-resume mechanism requires valid syntax to parse errors; syntax errors might disrupt the loop.
Requires high-quality synthetic data for SFT to bootstrap the RL process.

Reproducibility

Code: https://github.com/deepseek-ai/DeepSeek-Prover-V1.5

Models (Base, SFT, RL) and MCTS code are publicly available at https://github.com/deepseek-ai/DeepSeek-Prover-V1.5. Training data includes synthetic proofs and augmentations. Specific compute hours for training are not explicitly reported.

📊 Experiments & Results

Evaluation Setup

Formal theorem proving in Lean 4

Benchmarks:

miniF2F (High-school level competition math problems)
ProofNet (Undergraduate level math problems)

Metrics:

Pass@1
Pass@K (K up to 3200 for tree search)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance improvements on miniF2F-test benchmark across model iterations and search strategies.
miniF2F-test	Pass Rate (%)	50.0	60.2	+10.2
miniF2F-test	Pass Rate (%)	50.0	63.5	+13.5
Performance improvements on ProofNet-test benchmark showing generalization to undergraduate math.
ProofNet-test	Pass Rate (%)	13.8	25.3	+11.5
miniF2F-test	Pass Rate (%)	55.7	60.2	+4.5

Experiment Figures

Pass@K curves for Base, SFT, and RL models on miniF2F and ProofNet as sample budget K increases.

Main Takeaways

Reinforcement learning from proof assistant feedback (RLPAF) significantly boosts performance over SFT alone (+4.5% on miniF2F).
The truncate-and-resume mechanism allows the model to recover from errors and effectively utilize intermediate tactic states, bridging the gap between whole-proof and step-wise generation.
RMaxTS (curiosity-driven tree search) further enhances performance by effectively exploring the sparse reward landscape of theorem proving.
Chain-of-thought (CoT) prompting, integrated as comments, consistently outperforms non-CoT generation in formal proof synthesis.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) concepts (policy, reward, exploration)
Formal Verification (Lean 4)
Monte-Carlo Tree Search (MCTS)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

RLPAF: Reinforcement Learning from Proof Assistant Feedback—using the binary success/failure signal from a formal verifier as a reward for RL

MCTS: Monte-Carlo Tree Search—a heuristic search algorithm for decision processes that builds a search tree by sampling random outcomes

RMaxTS: A variant of MCTS proposed in this paper that uses the R-Max principle (optimism in the face of uncertainty) to encourage exploration of unvisited states

Lean 4: A functional programming language and interactive theorem prover used for formalizing mathematics

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy based on the relative performance of a group of outputs for the same input, eliminating the need for a critic model

tactic state: The current logical context in a proof (hypotheses and goals) returned by the proof assistant after applying a tactic

truncate-and-resume: A mechanism where invalid proof generation is cut off at the first error, and generation restarts from that point using the correct compiler state

CoT: Chain-of-Thought—a prompting strategy where the model generates natural language reasoning steps before producing the formal code

pass@K: A metric measuring the probability that at least one correct solution is generated within K attempts

intrinsic reward: An artificial reward signal generated by the agent itself (e.g., for visiting new states) to motivate exploration when external rewards are sparse