DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

📝 Paper Summary

Neural Theorem Proving Formal Mathematics Reasoning

DeepSeek-Prover-V2 unifies informal mathematical reasoning and formal verification by training a model to decompose complex theorems into subgoals using synthetic cold-start data and reinforcement learning.

Core Problem

LLMs excel at informal natural language reasoning but lack the syntactic rigor required for formal verification systems like Lean, often relying on heuristics that fail strict logical checks.

Why it matters:

Formal verification systems permit no ambiguity or implicit assumptions, making direct translation from informal LLM reasoning difficult
Existing formal reasoning training signals are sparse because most proof attempts fail and provide no reward
Bridging the gap between high-level human-like proof sketches and low-level formal tactics is a longstanding challenge in neural theorem proving

Concrete Example: A general-purpose LLM might correctly outline a proof for an algebra problem in English but fail to generate valid Lean code because it misses specific syntax or intermediate lemmas. DeepSeek-Prover-V2 prompts the model to generate 'have' statements with 'sorry' (subgoals), which are then recursively solved, ensuring the high-level plan translates to valid code.

Key Novelty

Subgoal-based Recursive Proving with Cold-Start RL

Synthesize 'cold-start' training data by prompting a strong model (DeepSeek-V3) to decompose proofs into Lean subgoals, then recursively solving those subgoals with a smaller prover
Train a single model to perform both high-level informal reasoning (Chain-of-Thought) and low-level formal tactic generation
Apply Reinforcement Learning (GRPO) with a consistency reward that penalizes deviations from the planned subgoal structure during early training

Architecture

Conceptual flow of the subgoal decomposition and recursive proving strategy.

Evaluation Highlights

88.9% Pass ratio on MiniF2F-test (Pass@8192), a state-of-the-art result for neural theorem proving
Solves 47 out of 658 problems on PutnamBench, a challenging collegiate-level mathematics benchmark
Solves 6 out of 15 highly challenging AIME 2024-2025 problems in Lean 4 formal language

Breakthrough Assessment

9/10

Establishes a new SOTA on standard benchmarks (MiniF2F) by a significant margin and demonstrates capability on competition-level problems (AIME, Putnam) previously out of reach for formal provers.

⚙️ Technical Details

Problem Definition

Setting: Formal theorem proving in Lean 4

Inputs: A formal theorem statement in Lean 4

Outputs: A complete, verifiable formal proof (sequence of tactics) ending in QED

Pipeline Flow

Input Processing (Theorem Statement)
Reasoning & Decomposition (CoT Generation)
Formal Proof Generation (Tactic Generation)

System Modules

Reasoning & Decomposition (Generation)

Analyze the problem in natural language and decompose it into formal subgoals (using 'have' statements)

Model or implementation: DeepSeek-Prover-V2-671B (Fine-tuned DeepSeek-V3)

Formal Proof Generator (Generation)

Generate precise Lean tactics to prove the subgoals and the final theorem

Model or implementation: DeepSeek-Prover-V2-671B (Same model)

Novel Architectural Elements

Unified model for both informal CoT reasoning and formal code generation, unlike previous approaches that separated these or used informal-only sketches
Curriculum learning pipeline where the model is trained on variations of problems (subgoals as premises vs. goals) to progressively increase difficulty

Modeling

Base Model: DeepSeek-V3-Base-671B (MoE model)

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy to maximize expected reward (correct proof).

Formally: GRPO objective using relative rewards within a group of samples.
Purpose: Enforce alignment between generated proof structure and planned decomposition.

Formally: Consistency reward (early training) penalizing missing 'have'-structured lemmas.

Adaptation: Full fine-tuning

Trainable Parameters: 671B (total parameters)

Training Data:

Non-CoT data from expert iteration (Lean code only)
Synthetic Cold-Start CoT data: DeepSeek-V3 reasoning combined with formalized subgoals solved by a smaller 7B prover
Augmented with autoformalized problems and open-source datasets (MiniF2F, etc.)

Key Hyperparameters:

learning_rate: 5e-6 (SFT)
context_window: 16384 (SFT), 32768 (RL generation)
group_size: 32 (candidate proofs per theorem in RL)
+ 1 more
batch_size: 256 (problems per iteration in RL)

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-Prover-V1.5: Incorporates explicit Chain-of-Thought (CoT) reasoning and subgoal decomposition into the training data and RL process
vs. DSP: Synthesizes training data by formalizing natural language proofs into structured formal sketches (subgoals) rather than just using sketches at inference time
vs. Kimina-Prover [not cited in paper]: Kimina-Prover retrosynthesizes thoughts from proofs; DeepSeek-Prover-V2 synthesizes formal sketches from forward reasoning
+ 1 more
vs. AlphaProof: Adopts similar curriculum learning (variations of problems) but emphasizes the unified informal-formal reasoning model via GRPO

Limitations

The 671B model is computationally expensive for inference compared to smaller provers
Reliance on a smaller 7B model for initial subgoal solving might limit the complexity of cold-start data
The consistency reward is a heuristic to force structure, which might be brittle for some proof styles

Reproducibility

Code: https://github.com/deepseek-ai/DeepSeek-Prover-V

Publicly available: Code and models at https://github.com/deepseek-ai/DeepSeek-Prover-V. ProverBench dataset introduced. Missing: Exact compute hours/resources for the 671B model training.

📊 Experiments & Results

Evaluation Setup

Formal theorem proving in Lean 4 environment

Benchmarks:

MiniF2F-test (High school competition math problems formalized in Lean)
ProofNet-test (Undergraduate level math problems)
PutnamBench (Putnam competition problems (very hard))
ProverBench (New collection of formalized problems including recent AIME (2024-2025)) [New]

Metrics:

Pass@1 (Accuracy with 1 attempt)
Pass@N (Accuracy with N attempts, e.g., N=32, 1024, 8192)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MiniF2F-test	Pass Ratio (%)	63.5	88.9	+25.4
MiniF2F-test	Pass@1	51.6	64.8	+13.2
ProofNet-test	Pass@1024	25.3	37.1	+11.8
PutnamBench	Problems Solved (Count)	11	47	+36
AIME 2024-2025 (Subset of ProverBench)	Problems Solved (Count)	8	6	-2

Main Takeaways

Subgoal decomposition via CoT significantly improves formal theorem proving performance, achieving SOTA on MiniF2F.
The gap between informal mathematical reasoning (LLMs) and formal proving is narrowing, with the formal prover solving nearly as many recent AIME problems as the base LLM.
Reinforcement learning (GRPO) starting from synthetic cold-start data effectively unifies informal reasoning and formal tactic generation.
The model generalizes well to much harder, unseen problems (PutnamBench) compared to previous versions.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of interactive theorem provers (specifically Lean 4)
Familiarity with Reinforcement Learning (RL) concepts like Policy Optimization
Understanding of Large Language Models (LLMs) and Chain-of-Thought (CoT) reasoning

Key Terms

Lean 4: A functional programming language and interactive theorem prover used for formalizing mathematics

CoT: Chain-of-Thought—a reasoning technique where the model generates intermediate steps before the final answer

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies based on relative rewards of a group of samples, eliminating the need for a critic model

subgoal: An intermediate lemma or proposition that serves as a stepping stone to proving the main theorem

tactic: A command or instruction in a theorem prover that advances the proof state

expert iteration: A training method where a model generates new data (proofs) which are filtered for correctness and used to retrain/improve the model

Pass@K: A metric measuring the probability that at least one correct solution is found in K generated attempts

cold start: Initial training phase using high-quality synthetic data to establish basic capabilities before reinforcement learning

RL: Reinforcement Learning—training models by rewarding desired behaviors (correct proofs) and penalizing incorrect ones