Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving

📝 Paper Summary

Automated Theorem Proving (ATP) Formal Reasoning Autoformalization

Goedel-Prover achieves state-of-the-art automated theorem proving by training on a massive dataset of 1.64 million autoformalized statements and iteratively refining itself through expert iteration.

Core Problem

Training LLMs for formal theorem proving is bottlenecked by the extreme scarcity of high-quality formal mathematical statements and proofs compared to informal natural language math data.

Why it matters:

Formal proofs allow machine verification of reasoning, unlike informal natural language reasoning which is error-prone and hard to verify
Existing formal datasets like Lean Workbook are small (only 15.7K proofs), limiting the ability of models to learn complex proof strategies
Manual formalization requires significant domain expertise and is not scalable

Concrete Example: Previous open-source models like DeepSeek-Prover-V1.5 rely on reinforcement learning on proprietary datasets or small public sets, achieving only 50.0% on miniF2F. Goedel-Prover scales data availability by autoformalizing 860K Numina problems, enabling purely supervised training to reach 57.6%.

Key Novelty

Scale-driven Expert Iteration with Autoformalized Data

Creates a massive formal dataset (Goedel-Pset-v1) by training two distinct formalizers to translate 1.6 million informal math problems into Lean 4 statements
Uses 'Whole-Proof' expert iteration: the model generates complete proofs without intermediate compiler interaction, verifies them, and retrains on the successful proofs in multiple rounds

Architecture

The Expert Iteration pipeline: Iterative loop of proof generation, verification, and retraining.

Evaluation Highlights

57.6% Pass@32 on miniF2F benchmark with SFT only, surpassing previous SOTA DeepSeek-Prover-V1.5-RL (50.0%) by 7.6%
Solves 7 problems on PutnamBench (Pass@512), ranking #1 on the leaderboard
Discovered 29.7K valid proofs for Lean Workbook problems, nearly doubling the 15.7K proofs found by prior provers

Breakthrough Assessment

9/10

Sets a new open-source SOTA on miniF2F and PutnamBench through a scalable data synthesis recipe. The release of 1.6M formal statements and ~30K new proofs is a major resource contribution.

⚙️ Technical Details

Problem Definition

Setting: Automated Theorem Proving in Lean 4

Inputs: Formal mathematical statement in Lean 4 (and optionally informal statement context)

Outputs: Complete formal proof script in Lean 4 that compiles successfully

Pipeline Flow

Formalizer A/B (translate informal problems to Lean statements)
Formal Statement Verification (compiler check + faithfulness check)
Prover (generates whole proofs for statements)
Proof Verification (Lean compiler checks proof correctness)

System Modules

Formalizer A (Data Generation)

Translate informal math to formal Lean statements (Style A)

Model or implementation: Qwen2.5-Coder-32B

Formalizer B (Data Generation)

Translate informal math to formal Lean statements (Style B)

Model or implementation: Qwen2.5-Coder-32B

Prover

Generate complete proofs for formal statements

Model or implementation: DeepSeek-Prover-V1.5-Base (iteratively updated)

Modeling

Base Model: DeepSeek-Prover-V1.5-Base

Training Method: Expert Iteration (Iterative SFT on self-generated correct proofs)

Objective Functions:

Purpose: Maximize likelihood of correct proofs.

Formally: Standard cross-entropy loss on the proof tokens.

Adaptation: Full fine-tuning

Training Data:

Goedel-Pset-v1: 1.64 million formal statements autoformalized from Numina and AOPS
Proofs generated by iteratively retrained provers (8 iterations)

Key Hyperparameters:

learning_rate: 1e-4 or 5e-5
batch_size: 8 (with packing)
epochs: 1 or 2 per iteration
+ 1 more
iterations: 8

Compute: Training: ~12 hours on 4 H100 GPUs per epoch/iteration. Inference: 6 hours on 64 H100 GPUs for Pass@16 on 1.78M statements. Verification: 10 hours on 8,000 CPUs.

Comparison to Prior Work

vs. DeepSeek-Prover-V1.5-RL: Achieves higher performance using only open-source data and SFT (vs proprietary data + RL)
vs. InternLM2.5-StepProver: Generates whole proofs instead of step-by-step search; scales data via massive autoformalization of Numina
vs. ABEL: Solves same number of PutnamBench problems (7) but with lower inference budget (Pass@512 vs Pass@596)

Limitations

Reliance on informal-to-formal translation may introduce unfaithfulness in problem statements (mitigated but not eliminated by FC test)
Whole-proof generation does not utilize intermediate compiler feedback during generation (unlike step-wise provers)
RL/DPO extensions prone to overfitting 'shortcuts' and benefit less from inference-time compute scaling

Reproducibility

Code: https://github.com/Goedel-LM/Goedel-Prover

publicly available (https://github.com/Goedel-LM/Goedel-Prover). Released: Code, SFT model, DPO model, Formalizer models, Goedel-Pset-v1 dataset, Goedel-Pset-v1-solved dataset, and 29.7K new Lean Workbook proofs.

📊 Experiments & Results

Evaluation Setup

Formal theorem proving in Lean 4

Benchmarks:

miniF2F (High-school competition math (Lean 4))
PutnamBench (Undergraduate competition math (Lean 4))
Lean Workbook (Large-scale math problems (Lean 4))

Metrics:

Pass@k (percentage of problems solved with k samples)
Cumulative number of problems solved
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
miniF2F	Pass@32	50.0	57.6	+7.6
miniF2F	Pass@3200	54.9	62.7	+7.8
PutnamBench	Solved Problems (Pass@512)	6	7	+1
Lean Workbook	Total Proven Statements	15700	29700	+14000

Experiment Figures

Left: miniF2F Pass@32 comparison. Middle: miniF2F performance scaling with sample budget (Pass@K). Right: Cumulative solved problems in Lean Workbook.

Performance improvement on miniF2F and NuminaTest across the 8 iterations of expert iteration.

Main Takeaways

Scaling formal statements via autoformalization + expert iteration is highly effective, surpassing previous methods that relied on RL with smaller or proprietary datasets.
Two distinct formalizers (different training data sources) increase statement diversity, which benefits the prover's performance.
Pure SFT on high-quality iterative data outperforms RL baselines, though adding RL/DPO can squeeze out further gains (reaching >60% on miniF2F) at the cost of compute scaling efficiency.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Lean 4 proof assistant
Understanding of Large Language Models (LLMs) and Supervised Fine-Tuning (SFT)
Basic knowledge of Expert Iteration / Reinforcement Learning

Key Terms

Lean 4: A functional programming language and interactive theorem prover used to formalize mathematics

Autoformalization: The process of using LLMs to translate informal natural language math problems into formal code (e.g., Lean)

Expert Iteration: A training method where a model generates solutions, verifies them (e.g., via compiler), and retrains on the correct solutions to improve itself

SFT: Supervised Fine-Tuning—training a model on a labeled dataset of correct inputs and outputs

Pass@k: A metric measuring the percentage of problems solved given k attempts (samples) per problem

Whole-proof generation: Generating the entire proof script in one go, rather than interacting with the proof assistant step-by-step

DPO: Direct Preference Optimization—an algorithm for aligning models to preferences without a separate reward model

RL: Reinforcement Learning—training agents to take actions that maximize a cumulative reward