Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

📝 Paper Summary

Neural Theorem Proving Formal Mathematics (Lean 4) Reinforcement Learning for Reasoning

Kimina-Prover adapts a large-scale reinforcement learning pipeline to formal mathematics, using a structured reasoning pattern to enable internal exploration and achieving state-of-the-art Lean 4 proof generation without external search.

Core Problem

Existing neural theorem provers rely on computationally expensive external search algorithms (BFS, MCTS) and fail to capture deep structured reasoning, resulting in poor scaling with model size.

Why it matters:

External search methods like BFS/MCTS introduce significant computational overhead and complexity, limiting scalability
Standard supervised fine-tuning lacks the ability to elicit the non-linear, sophisticated reasoning required for complex formal proofs
Prior formal math models have not shown clear performance improvements when scaling model size, unlike informal reasoning models

Concrete Example: General reasoning models like OpenAI's o3 achieve 0% on the formal IMO subset of miniF2F because they default to informal, unverifiable answers. In contrast, Kimina-Prover generates valid Lean 4 code by internally verifying and refining steps, achieving 40% on the same subset.

Key Novelty

Reasoning-Driven Exploration with Formal Reasoning Patterns

Replaces external tree search (BFS/MCTS) with internal 'reasoning-driven exploration' where the model explores the proof space via long chain-of-thought tokens
Enforces a specific 'formal reasoning pattern' that interleaves informal mathematical thought with formal Lean code blocks, aligning intuition with verification
Applies the Kimi k1.5 reinforcement learning pipeline to formal math, utilizing a large-scale autoformalized dataset and binary correctness rewards

Architecture

The Reinforcement Learning pipeline and the 'Formal Reasoning Pattern'.

Evaluation Highlights

Achieves 80.7% accuracy (pass@8192) on miniF2F-test, setting a new state-of-the-art and surpassing the previous best (BFS Prover) of 72.95%
Demonstrates high sample efficiency with 52.94% pass@1 and 68.85% pass@32 on miniF2F, outperforming many search-based baselines requiring thousands of samples
Shows clear performance scaling with model size (1.5B → 7B → 72B), with the 72B model outperforming the 7B version by +7.87% at pass@8192

Breakthrough Assessment

9/10

First system to demonstrate clear scaling laws for formal theorem proving and achieve >80% on miniF2F without external search algorithms. Effectively bridges informal and formal reasoning via RL.

⚙️ Technical Details

Problem Definition

Setting: Automatic Theorem Proving in Lean 4

Inputs: Formal statement of a theorem in Lean 4 (and optional natural language context)

Outputs: A valid, compilable Lean 4 proof closing the theorem

Pipeline Flow

Problem Input (Lean Statement)
Kimina-Prover Generation (Reasoning + Proof)
Lean Server Verification
Output Selection

System Modules

Kimina-Prover

Generate the full proof, including internal reasoning traces

Model or implementation: Fine-tuned Qwen2.5-72B (or distilled 1.5B/7B)

Lean Server

Verify the generated Lean 4 code for correctness

Model or implementation: Numina Lean Server (based on Lean REPL)

Novel Architectural Elements

Internalized search via 'formal reasoning pattern': The model outputs explicit '<think>' blocks containing interleaved informal reasoning and formal tactic snippets, replacing external search trees with linear token generation.

Modeling

Base Model: Qwen2.5-72B

Training Method: Reinforcement Learning (Kimi k1.5 pipeline)

Objective Functions:

Purpose: Optimize policy to maximize expected reward while staying close to reference model.

Formally: Loss involves maximizing expected reward (proof correctness) subject to a KL divergence constraint with coefficient tau=0.4.

Adaptation: Full model training

Training Data:

Base problem set created via autoformalization of natural language problems using 'Kimina-Autoformalizer-7B'
SFT Data: 20K Claude-synthesized 'thinking' examples + informal math thinking data

Key Hyperparameters:

learning_rate: 2e-6
kl_coefficient_tau: 0.4
batch_size_N: 1000 problems per iteration
+ 2 more
rollouts_per_problem_k: 8
negative_gradient_discard_prob: 0.5

Compute: 640 CPU cores used for verification during training (GPU usage for LLM training not explicitly reported)

Comparison to Prior Work

vs. BFS Prover: Replaces explicit BFS tree search with implicit, internal chain-of-thought exploration enabled by RL
vs. DeepSeek-Prover-V1.5: Does not use MCTS; relies on single-turn generation with long context reasoning
vs. o3: Generates verifiable formal code rather than unverifiable informal math; o3 fails completely on formal tasks
+ 1 more
vs. Lean-STaR [not cited in paper]: Lean-STaR uses retrospective traces for training; Kimina-Prover focuses on large-scale RL with a specific interleaved thinking pattern and emphasizes scaling with model size

Limitations

RL training is volatile; accuracy can regress during training (mid-phase instability observed)
Requires high-quality synthetic data for 'cold start' SFT (only Claude 3.7 Sonnet produced satisfying results)
Performance on very hard problems (IMO-level) is still limited compared to human experts (40% on IMO subset)

Reproducibility

Code: https://github.com/MoonshotAI/Kimina-Prover-Preview

Publicly available: Distilled 1.5B and 7B model weights, corrected miniF2F dataset, and inference code. Missing: The main 72B model weights are not explicitly linked as open-source in the abstract/intro (only distilled versions mentioned). The 'Kimina-Autoformalizer-7B' is mentioned as open source. Training compute (GPUs) not specified.

📊 Experiments & Results

Evaluation Setup

Formal theorem proving in Lean 4 environment

Benchmarks:

miniF2F (Formal Math Proof Generation (High School/Olympiad))

Metrics:

Pass@k (Accuracy with k samples)
Pass@1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on miniF2F benchmark showing State-of-the-Art performance.
miniF2F-test	Pass@8192	72.95	80.74	+7.79
miniF2F-test	Pass@1	50.4	52.94	+2.54
Scaling analysis showing performance gains with increased model size.
miniF2F-test	Pass@8192	72.87	80.74	+7.87
Comparison with general-purpose reasoning models on specific subsets.
miniF2F-test (IMO Subset)	Pass@32	5.00	20.00	+15.00

Experiment Figures

Evolution of Pass@32 accuracy and average output token length during RL training.

Main Takeaways

Formal reasoning performance scales with model size (1.5B to 72B), a trend not previously established in neural theorem proving
Implicit 'reasoning-driven exploration' via RL is more sample-efficient and effective than explicit search algorithms (BFS/MCTS) for this task
General-purpose reasoning models (o3, Gemini) struggle with formal verification despite strong informal math skills, highlighting the need for domain-specific formal training
RL training dynamics in formal math are more volatile than informal math, likely due to the strictness of the formal environment

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Lean 4 and interactive theorem proving
Understanding of Large Language Models (LLMs) and Chain-of-Thought (CoT) reasoning
Basics of Reinforcement Learning (RL) and policy gradients

Key Terms

Lean 4: A functional programming language and interactive theorem prover used for formalizing mathematics

tactic: A command in Lean that transforms the current proof state (goal) into simpler subgoals

pass@k: An evaluation metric measuring the probability that at least one correct solution is generated out of k samples

autoformalization: The process of automatically translating natural language mathematics into formal code (e.g., Lean)

SFT: Supervised Fine-Tuning—training a model on labeled examples before applying reinforcement learning

RL: Reinforcement Learning—training an agent (here the LLM) to maximize a reward signal (proof correctness)

BFS: Best-First Search—a search algorithm that explores a graph by expanding the most promising nodes first

MCTS: Monte Carlo Tree Search—a heuristic search algorithm for decision processes, often used in game playing

CoT: Chain-of-Thought—a prompting technique encouraging models to generate intermediate reasoning steps

KL divergence: A measure of how one probability distribution differs from a second, reference probability distribution