← Back to Paper List

Llemma: An Open Language Model For Mathematics

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, S. McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, S. Welleck
Princeton University, EleutherAI
International Conference on Learning Representations (2023)
Pretraining Reasoning Agent Benchmark

📝 Paper Summary

Mathematical Reasoning Domain Adaptation for LLMs
Llemma adapts Code Llama to mathematics via continued pretraining on a massive mixture of scientific papers, web math, and code, achieving state-of-the-art open performance.
Core Problem
Generalist language models often struggle with deep specialized domains like mathematics, while existing domain-specific models are either closed-source (Minerva) or lag significantly behind in capability.
Why it matters:
  • Closed-access models limit the research community's ability to study mathematical reasoning, reward modeling, and reinforcement learning for reasoning
  • Solving math problems requires pattern matching against specialized prior knowledge not sufficiently represented in general pretraining corpora
  • Strong mathematical reasoning capabilities are upstream of critical research topics like algorithmic reasoning and formal verification
Concrete Example: When solving a formal theorem proving task in Lean 4, a standard Code Llama model may struggle to generate correct tactics due to insufficient exposure to proof states, whereas Llemma, trained on the AlgebraicStack, can successfully predict valid proof steps.
Key Novelty
Llemma (Math-Adapted Code Llama)
  • Continues pretraining Code Llama on Proof-Pile-2, a curated 55B-token dataset mixing scientific papers, web math (OpenWebMath), and mathematical code (AlgebraicStack)
  • Leverages the synergy between code and mathematics by initializing from a strong code model rather than a general text model
  • Integrates computational tools and formal languages directly into the pretraining distribution via the AlgebraicStack dataset
Evaluation Highlights
  • Llemma-34B outperforms Code Llama-34B by +20 percentage points on GSM8k and +13 points on MATH
  • Llemma-7B outperforms the proprietary Minerva-8B model on the MATH benchmark on an equi-parameter basis
  • Llemma-7B closes 26.23% of theorems on miniF2F-test (formal-to-formal), surpassing its Code Llama initialization (20.49%)
Breakthrough Assessment
9/10
Establishes a new open SOTA for mathematics, releasing not just models but the critical training datasets (Proof-Pile-2) that enable replication and further research.
×