LIMO: Less is More for Reasoning

📝 Paper Summary

Mathematical Reasoning Data Efficiency in LLMs

LIMO demonstrates that strong mathematical reasoning can be elicited from foundation models using only 800 high-quality, long-chain examples rather than massive datasets, challenging the need for extensive training data.

Core Problem

Current approaches for teaching reasoning to Large Language Models (LLMs) rely on massive datasets (tens/hundreds of thousands of examples), assuming complex reasoning requires extensive supervision.

Why it matters:

Training on massive datasets is computationally expensive and data-inefficient
Large-scale Supervised Fine-Tuning (SFT) often leads to memorization rather than true generalization
It is unclear if models are actually learning to reason or just retrieving memorized solution patterns

Concrete Example: When solving a complex American Invitational Mathematics Examination (AIME) problem, a standard model trained on 100k generic math pairs might apply a shallow heuristic and fail. LIMO, trained on just 800 examples, activates pre-trained knowledge to generate a long, self-verifying chain of thought (e.g., 'Let me check this intermediate step...') to reach the correct solution.

Key Novelty

Less-Is-More Reasoning (LIMO) Hypothesis

Posits that sophisticated reasoning is not 'learned' from scratch but 'elicited' from the pre-trained knowledge base using minimal examples
Identifies two elicitation factors: the model's latent knowledge and the quality of examples acting as 'cognitive templates' to trigger extended inference-time computation
Employs a strict multi-stage filtering pipeline to select only 800 highly difficult yet solvable problems with elaborate, self-verifying reasoning chains

Architecture

The LIMO Data Curation Pipeline

Evaluation Highlights

Achieves 63.3% accuracy on AIME24, surpassing previous fine-tuned models (6.5%) using only 1% of the training data
Scores 95.6% on MATH500, outperforming the baseline of 59.2% by a massive margin
Demonstrates strong Out-Of-Distribution (OOD) generalization, achieving 45.8% absolute improvement across diverse benchmarks compared to models trained on 100x more data

Breakthrough Assessment

9/10

Challenge fundamental assumptions about the data scale required for reasoning. Achieving SOTA-level performance with only 800 examples suggests a major paradigm shift from knowledge injection to capability elicitation.

⚙️ Technical Details

Problem Definition

Setting: Mathematical reasoning tasks with verifiable answers

Inputs: A mathematical question q

Outputs: A reasoning chain r (consisting of intermediate steps) and a final answer a

Pipeline Flow

Input Question -> LIMO Model (Qwen2.5-Instruct base) -> Long Chain-of-Thought Generation -> Final Answer

System Modules

LIMO Model

Generate extended reasoning chains and answers

Model or implementation: Qwen2.5-32B-Instruct (Fine-tuned)

Novel Architectural Elements

The system relies on 'inference-time computation scaling' facilitated by the specific nature of the 800 training examples, which act as cognitive templates for long-form reasoning

Modeling

Base Model: Qwen2.5-32B-Instruct

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Minimize prediction error on the target tokens.

Formally: Standard causal language modeling loss (implied by SFT context)

Adaptation: Full-parameter fine-tuning

Training Data:

Initial Pool: 2,125 problems filtered from NuminaMath, DeepScaleR, AIME, MATH
Difficulty Filter 1: Remove problems solvable by Qwen2.5-Math-7B (too easy)
Difficulty Filter 2: Keep problems solvable by DeepSeek-R1-Distill-Qwen-32B in 1-3 out of 32 attempts (hard but solvable)
Quality Filter: Score solutions based on length (30%), self-verification (20%), exploratory language (25%), and adaptive granularity (25%)
Final Dataset: Top 800 samples

Key Hyperparameters:

learning_rate: 5.0e-6
batch_size: 64
epochs: 15
+ 3 more
warmup: Omited (0 steps)
optimizer: DeepSpeed ZeRO-3
max_sequence_length: 16384 tokens

Compute: Not reported in the paper

Comparison to Prior Work

vs. NuminaMath: LIMO uses 1% of the data (800 vs tens of thousands) and focuses on quality/difficulty filtering rather than scale
vs. Standard SFT: LIMO intentionally omits warmup and trains for many epochs (15) on a tiny dataset to enforce 'cognitive templates' without overfitting to specific answers
vs. LIMA [not cited in paper]: LIMA showed 'less is more' for general alignment; LIMO extends this specifically to complex mathematical reasoning which was previously thought to require large-scale data

Limitations

Relies on the presence of rich domain knowledge in the pre-trained foundation model; may not work for domains not covered in pre-training
The curation process is computationally expensive (requires sampling solutions from multiple strong models)
Strictly verified only on mathematical reasoning with verifiable answers

Reproducibility

Code: https://github.com/GAIR-NLP/LIMO

Models, code, and curated datasets are publicly available (GitHub). The specific training recipe (hyperparameters) and data curation pipeline are detailed in the paper.

📊 Experiments & Results

Evaluation Setup

Zero-shot chain-of-thought evaluation

Benchmarks:

AIME24 (High-difficulty Math Competition)
MATH500 (Competitive Math Problems)
OlympiadBench (Math Olympiad)
Gaokao/Kaoyan (Chinese Entrance Exams (OOD)) [New]
Minerva/GPQA (STEM/General Science (OOD))

Metrics:

Pass@1 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LIMO achieves massive improvements over previous fine-tuned baselines on standard in-domain mathematical benchmarks.
AIME24	Pass@1	6.5	63.3	+56.8
MATH500	Pass@1	59.2	95.6	+36.4
LIMO demonstrates superior generalization on Out-Of-Distribution (OOD) tasks, suggesting it learns reasoning patterns rather than memorizing domain data.
Diverse Benchmarks Average	Absolute Improvement	0.0	45.8	+45.8

Main Takeaways

Complex reasoning can be elicited with as few as 800 examples if the pre-trained model has sufficient knowledge.
Data quality (difficulty and reasoning chain detail) is far more critical than data quantity for reasoning tasks.
LIMO exhibits strong OOD generalization, outperforming models trained on much larger datasets, indicating it learns a 'cognitive template' for reasoning rather than memorizing specific problem types.

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-Tuning (SFT) concepts
Chain-of-Thought (CoT) prompting
Basics of foundation model pre-training vs. post-training

Key Terms

SFT: Supervised Fine-Tuning—retraining a pre-trained model on a smaller, labeled dataset to adapt it for specific tasks

CoT: Chain-of-Thought—a prompting or reasoning style where the model generates intermediate logical steps before producing the final answer

Inference-time computation: The computational work (processing steps/tokens) a model performs while generating an answer, which correlates with reasoning depth

OOD: Out-of-Distribution—test cases that differ significantly from the data seen during training (e.g., different problem types or languages)

AIME: American Invitational Mathematics Examination—a highly challenging high school mathematics competition

Pass@1: An evaluation metric measuring the percentage of problems where the model's first generated answer is correct

Foundation Model: A large-scale model (like Llama or Qwen) pre-trained on vast amounts of data, serving as a base for specific applications

Elicitation: The process of triggering or unlocking capabilities already present in a model's pre-trained weights, rather than teaching new knowledge

Self-Verification: A reasoning step where the model explicitly checks its own intermediate work for errors