Llemma: An Open Language Model For Mathematics

📝 Paper Summary

Mathematical Reasoning Domain Adaptation for LLMs

Llemma adapts Code Llama to mathematics via continued pretraining on a massive mixture of scientific papers, web math, and code, achieving state-of-the-art open performance.

Core Problem

Generalist language models often struggle with deep specialized domains like mathematics, while existing domain-specific models are either closed-source (Minerva) or lag significantly behind in capability.

Why it matters:

Closed-access models limit the research community's ability to study mathematical reasoning, reward modeling, and reinforcement learning for reasoning
Solving math problems requires pattern matching against specialized prior knowledge not sufficiently represented in general pretraining corpora
Strong mathematical reasoning capabilities are upstream of critical research topics like algorithmic reasoning and formal verification

Concrete Example: When solving a formal theorem proving task in Lean 4, a standard Code Llama model may struggle to generate correct tactics due to insufficient exposure to proof states, whereas Llemma, trained on the AlgebraicStack, can successfully predict valid proof steps.

Key Novelty

Llemma (Math-Adapted Code Llama)

Continues pretraining Code Llama on Proof-Pile-2, a curated 55B-token dataset mixing scientific papers, web math (OpenWebMath), and mathematical code (AlgebraicStack)
Leverages the synergy between code and mathematics by initializing from a strong code model rather than a general text model
Integrates computational tools and formal languages directly into the pretraining distribution via the AlgebraicStack dataset

Evaluation Highlights

Llemma-34B outperforms Code Llama-34B by +20 percentage points on GSM8k and +13 points on MATH
Llemma-7B outperforms the proprietary Minerva-8B model on the MATH benchmark on an equi-parameter basis
Llemma-7B closes 26.23% of theorems on miniF2F-test (formal-to-formal), surpassing its Code Llama initialization (20.49%)

Breakthrough Assessment

9/10

Establishes a new open SOTA for mathematics, releasing not just models but the critical training datasets (Proof-Pile-2) that enable replication and further research.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive language modeling specialized for mathematical text and code

Inputs: Mathematical problem statements (Natural Language or LaTeX), code snippets, or formal proof states

Outputs: Solutions, code execution steps, or formal proof tactics

Pipeline Flow

Data Curation (Proof-Pile-2)
Initialization (Code Llama)
Continued Pretraining
Inference/Evaluation

System Modules

Base Model

Provide strong initial code and language reasoning capabilities

Model or implementation: Code Llama (7B and 34B)

Math Adapter

Adapt the base model to the distribution of mathematical text and code

Model or implementation: Transformer (Decoder-only)

Novel Architectural Elements

No novel model architecture (uses standard Llama 2 / Code Llama architecture)
Novel data mixture composition: AlgebraicStack (code) + OpenWebMath (web) + ArXiv (papers) specifically tuned for math

Modeling

Base Model: Code Llama (7B and 34B)

Training Method: Continued Pretraining (Domain Adaptation)

Objective Functions:

Purpose: Predict the next token in the sequence.

Formally: Standard autoregressive cross-entropy loss.

Training Data:

Proof-Pile-2 (55B tokens total)
AlgebraicStack (11B tokens): Filtered code from The Stack, GitHub, and formal proof data
OpenWebMath (15B tokens): Web pages filtered for math content
ArXiv subset of RedPajama (29B tokens)
General domain regularization: 2% Pile, 3% GitHub (RedPajama)

Key Hyperparameters:

learning_rate_7B: 1e-4
learning_rate_34B: 5e-5
batch_size: 4 million tokens
+ 6 more
context_length: 4096 tokens
training_steps_7B: 42,000
training_steps_34B: 12,000
precision: bfloat16
RoPE_base_period_7B: 10,000 (contracted from 1M)
RoPE_base_period_34B: 1,000,000

Compute: 256 A100 40GB GPUs. Llemma-7B: ~23,000 A100-hours. Llemma-34B: ~47,000 A100-hours.

Comparison to Prior Work

vs. Minerva: Llemma is open-weights and releases training data; Minerva is closed.
vs. Code Llama: Llemma is explicitly adapted to math via Proof-Pile-2; Code Llama is code-only.
vs. Galactica [not cited in paper]: Galactica trained on scientific papers but not explicitly on the code-heavy AlgebraicStack mixture.
+ 1 more
vs. ReProver: Llemma uses few-shot prompting for tactics; ReProver uses retrieval-augmented finetuning.

Limitations

Llemma-7B experienced unstable optimization (NaN losses) requiring early stopping at 42k steps
Model accuracy on difficult problems remains low regardless of memorization checks
Evaluation is limited to English-language mathematical content

Reproducibility

Code: https://github.com/EleutherAI/math-lm

All artifacts released: Llemma models (7B, 34B), Proof-Pile-2 dataset, and code for replication. Training used GPT-NeoX library. Evaluation used a fork of Language Model Evaluation Harness.

📊 Experiments & Results

Evaluation Setup

Few-shot evaluation on mathematical reasoning and formal proving tasks

Benchmarks:

MATH (High-school math competition problems (LaTeX))
GSM8k (Middle-school math word problems)
OCWCourses (Undergraduate STEM problems)
miniF2F (Formal theorem proving (Lean 4, Isabelle))
SAT (May 2023 math questions (post-cutoff)) [New]

Metrics:

Accuracy (Exact Match / SymPy equivalence)
Pass@1 (Theorem Proving)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Llemma improves significantly over its initialization (Code Llama) and generalist baselines (Llama 2) on standard math benchmarks.
MATH	Accuracy	38.5	51.5	+13.0
GSM8k	Accuracy	48.8	69.0	+20.2
MATH	Accuracy	27.6	35.0	+7.4
GSM8k+Python	Accuracy	65.3	75.7	+10.4
miniF2F-test (Isabelle)	Pass@1	17.62	22.13	+4.51
miniF2F-test (Lean 4)	Pass@1	20.49	26.23	+5.74

Main Takeaways

Continued pretraining on a domain-specific mix of code and text (Proof-Pile-2) yields substantial gains in mathematical reasoning over code-only or text-only baselines.
Llemma effectively leverages tool use (Python) and formal languages (Lean/Isabelle) without specific supervised finetuning, utilizing the capabilities inherent in the pretraining data.
Memorization analysis suggests that while 'hits' (30-gram matches) occur between test sets and training data, they do not correlate strongly with higher accuracy on difficult problems, implying genuine reasoning improvements.
The AlgebraicStack dataset enables open-source models to compete with or surpass specialized systems like Sledgehammer in formal theorem proving.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture and pretraining
Basics of formal theorem proving (Lean, Isabelle)
Mathematical evaluation benchmarks (MATH, GSM8k)

Key Terms

Proof-Pile-2: A 55B-token dataset created by the authors, consisting of scientific papers, OpenWebMath, and AlgebraicStack

AlgebraicStack: A dataset of 11B tokens of source code specifically related to mathematics, spanning numerical, symbolic, and formal math

Code Llama: A family of large language models for code, based on Llama 2, used as the initialization for Llemma

Minerva: A proprietary Google model initialized from PaLM and trained on technical content; a key baseline

RoPE: Rotary Positional Embeddings—a method for encoding position information in transformers

Flash Attention 2: An algorithm that speeds up attention computation and reduces memory usage

chain of thought: A prompting technique where the model generates intermediate reasoning steps before the final answer

majority voting: An ensemble technique where multiple reasoning paths are generated and the most consistent answer is selected

SymPy: A Python library for symbolic mathematics, used here for verifying mathematical equivalence of answers

miniF2F: A benchmark for formal theorem proving containing olympiad-level math problems

formal-to-formal proving: Generating formal proof steps (tactics) given a formal statement and proof state

informal-to-formal proving: Generating a formal proof (autoformalization) given an informal natural language statement and proof

OpenWebMath: A 15B-token dataset of high-quality web pages filtered for mathematical content