Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

📝 Paper Summary

Automated Theorem Proving (ATP) Formal Mathematics (Lean 4) Neuro-symbolic Geometry Solving

Seed-Prover combines a lemma-style whole-proof LLM with iterative refinement and a specialized geometry engine to solve IMO-level math problems in Lean 4.

Core Problem

LLMs struggle with formal theorem proving due to the lack of intermediate supervision and the difficulty of verifying natural language proofs.

Why it matters:

Natural language proofs are hard to verify automatically, hindering effective reinforcement learning
Step-level provers (generating line-by-line) often lack high-level reasoning capabilities needed for complex proofs
Existing whole-proof models fail to effectively utilize compiler feedback or explore broad problem properties

Concrete Example: For a challenging functional equation problem, a standard prover might try to solve it directly and fail. Seed-Prover proposes diverse conjectures (e.g., 'f is injective') and proves them as lemmas first, building a library of facts to solve the main problem.

Key Novelty

Lemma-Style Proving & Test-Time Scaling Strategies

Prioritizes generating independent, reusable 'lemmas' before the main theorem, allowing modular verification and progress tracking unlike monolithic whole-proof generation
Uses a 'Proposer' module to broadcast many conjectures (broad reasoning) about the problem, proving them to populate a lemma pool for the main proof
Implements a specialized 'Seed-Geometry' engine that combines a neural policy for auxiliary constructions with a fast C++ symbolic forward-chaining engine

Evaluation Highlights

Proved 5 out of 6 problems in IMO 2025 (post-contest time window), including the geometry problem in under 2 seconds
Achieved 78.1% success rate on 155 past IMO problems (2000-2024), establishing a new state-of-the-art
Saturates the MiniF2F-test benchmark with 99.6% accuracy using medium inference settings

Breakthrough Assessment

9/10

Achieves near-perfect scores on standard benchmarks (MiniF2F) and solves 5/6 IMO 2025 problems, demonstrating human-silver-medal level capability in formal math.

⚙️ Technical Details

Problem Definition

Setting: Automated Theorem Proving in Lean 4 and Geometry Problem Solving

Inputs: Formal statement of a mathematical problem (and optionally natural language description)

Outputs: A complete, verifiable proof in Lean 4 code

Pipeline Flow

Seed-Geometry: Policy Model (proposes auxiliaries) → Symbolic Engine (C++ forward-chaining)
Seed-Prover (Light/Medium): Problem → Initial Proof → Lean Compiler Feedback → Refinement Loop
Seed-Prover (Heavy): Problem → Proposer (Generate Conjectures) → Prove Conjectures (Light Setting) → Lemma Pool → Main Proof (Medium Setting)

System Modules

Seed-Geometry Policy (Geometry Engine)

Propose auxiliary geometric constructions (e.g., adding points/lines)

Model or implementation: Seed model (Pretrained on coding/math)

Seed-Geometry Engine (Geometry Engine)

Deduce new facts from constructions via forward-chaining rules

Model or implementation: Symbolic C++ Backend (Fast)

Seed-Prover Generator (Formal Prover)

Generate Lean 4 code including intermediate lemmas and main theorems

Model or implementation: Seed-Prover (LLM specialized in Lean)

Proposer (Formal Prover)

Generate diverse conjectures about the problem properties

Model or implementation: Seed-Prover (variant or prompting strategy)

Novel Architectural Elements

Three-tiered inference strategy (Light, Medium, Heavy) that dynamically allocates compute from simple refinement to massive conjecture broadcasting
Lemma-Pool architecture where proved conjectures from broad search are stored and retrieved to aid the main proof
Integration of a dedicated C++ geometry engine with composite DSL actions into the neural proving pipeline

Modeling

Base Model: Seed model (Pretrained on coding and mathematics)

Training Method: Multi-stage, multi-task Reinforcement Learning (VAPO)

Objective Functions:

Purpose: Encourage correct proofs.

Formally: Reward = 1 if proved, 0 otherwise.
Purpose: Encourage lemma-style structure.

Formally: Formatting penalty applied if lemmas are not generated before theorem.

Training Data:

Open-source datasets (Lean Workbook, etc.) + In-house formalized problems
Hard problems decomposed into easier variants by the Proposer
Too easy problems (proof rate > 1/4) excluded from RL

Key Hyperparameters:

inference_refinements_light: 8-16
conjecture_pool_size: 5000 (default)

Compute: Seed-Geometry engine rewritten in C++ for 100x speedup; Distibuted GPU setup for beam search.

Comparison to Prior Work

vs. AlphaProof: Seed-Prover uses lemma-style whole-proof generation instead of strictly step-level or natural language focus [implicit comparison]
vs. AlphaGeometry 2: Seed-Geometry uses a faster C++ engine and extended DSL (composite actions), solving more IMO problems (43 vs 42)
vs. DeepSeek-Prover-V1.5: Seed-Prover introduces 'Heavy' inference with broad conjecture search (Proposer) rather than just MCTS on proof steps
+ 1 more
vs. DSP (Draft, Sketch, Prove) [not cited in paper]: DSP relies on informal proofs to guide formal ones; Seed-Prover's 'Proposer' explores properties (conjectures) rather than full informal proofs

Limitations

Still struggles with some combinatorics problems (30.0% on CombiBench)
Heavy inference requires significant compute (days of thinking, thousands of conjectures)
Geometry engine relies on specific DSL and auxiliary construction paradigm, may not generalize to non-constructive geometry
Human experts were needed to translate IMO 2025 problems into formal statements

Reproducibility

Code: https://github.com/ByteDance-Seed/Seed-Prover

Code availability: GitHub repo linked (https://github.com/ByteDance-Seed/Seed-Prover) but content may be limited at preprint stage. LooKeng interface mentioned. 230M geometry problem dataset generated internally. Model weights not explicitly stated as open.

📊 Experiments & Results

Evaluation Setup

Formal Theorem Proving in Lean 4

Benchmarks:

IMO 2025 (Competition Math)
Past IMO (2000-2024) (Competition Math) [New]
MiniF2F (High school/undergrad math)
PutnamBench (Undergraduate competition math)
CombiBench (Combinatorics)
IMO-AG-50 (Geometry)

Metrics:

Pass rate (%)
Number of problems solved
Pass@k
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on high-difficulty competition benchmarks (IMO and Putnam).
MiniF2F-test	Pass Rate	92.2	99.6	+7.4
PutnamBench	Number of proved statements	86	331	+245
CombiBench	Pass Rate	10.0	30.0	+20.0
MiniCTX-v2	Pass Rate	44.3	81.8	+37.5
Geometry-specific performance comparisons.
IMO-AG-50	Problems Solved	42	43	+1
IMOSL-AG-30 (Shortlist)	Problems Solved	19	22	+3

Main Takeaways

Test-time scaling via 'Heavy' inference (conjecture broadcasting) is critical for the hardest problems (IMO 2025 P1, P3, P4).
Lemma-style proving enables solving problems where monolithic whole-proof generation fails, by accumulating verified intermediate steps.
Seed-Geometry's C++ engine and extended DSL (composite actions) provide speed and expressivity advantages over AlphaGeometry 2.
Iterative refinement with compiler feedback (Light setting) significantly boosts performance over single-pass generation (e.g., solving IMO 2022 P2).

📚 Prerequisite Knowledge

Prerequisites

Formal verification (specifically Lean 4)
Large Language Models (LLMs) and Chain-of-Thought (CoT)
Reinforcement Learning (RL) for reasoning
Symbolic Geometry Solvers (Forward-chaining)

Key Terms

Lemma-Style Proving: A strategy where the model explicitly generates and proves intermediate lemmas before attempting the main theorem, creating a modular proof structure.

Pass@k: A metric measuring the probability that at least one correct solution is generated out of k attempts.

MOHS: Math Olympiad Hardness Scale—a metric used to quantify the difficulty of math competition problems.

Forward-chaining: A reasoning method that starts with known facts and applies rules to derive new facts until the goal is reached.

Auxiliary constructions: Additional lines or points added to a geometry diagram (like drawing a parallel line) to facilitate a proof.

Isogonal conjugate: A specific geometric construction involving reflection of lines across angle bisectors, used here as a composite action in the DSL.

VAPO: A reinforcement learning algorithm used for training the prover (likely from cited work [29]).

Beam search: A search algorithm that explores a graph by expanding the most promising nodes in a limited set.

Ruler-and-compass construction: Geometric constructions permissible using only a straightedge and compass, forming the basis of the domain-specific language.