Toolformer: Language Models Can Teach Themselves to Use Tools

📝 Paper Summary

Tool-use post-training Invoking internalized APIs

Toolformer enables language models to teach themselves when and how to use external tools by generating, filtering, and fine-tuning on their own API calls using a self-supervised perplexity loss.

Core Problem

Large language models struggle with basic functions like arithmetic, factual lookups, and time awareness, but existing tool-use methods rely on expensive human annotation or task-specific constraints.

Why it matters:

Language models hallucinate facts and lack access to up-to-date information on recent events
Standard scaling does not resolve limitations in mathematical skills or low-resource language understanding
Reliance on human annotation limits the generality and scale of tool adoption in LMs

Concrete Example: When asked 'Who is the publisher of The New England Journal of Medicine?', a standard LM might hallucinate, whereas Toolformer autonomously calls '[QA("Who is the publisher...?")]' to retrieve 'Massachusetts Medical Society' and complete the text correctly.

Key Novelty

Self-Supervised API Bootstrapping

The model uses in-context learning to sample potential API calls within raw text, then executes them to get results
Calls are filtered based on whether providing the API result reduces the model's perplexity (loss) on future tokens compared to not having the result
The model fine-tunes on its own useful, filtered predictions, learning to invoke tools implicitly without human supervision

Architecture

The self-supervised annotation process for creating the Toolformer dataset.

Evaluation Highlights

Outperforms the much larger GPT-3 (175B) on LAMA (factual probing) by 13.7 points (53.5 vs 39.8) using a 6.7B model
Achieves 40.4% accuracy on ASDiv math benchmark, nearly 3x the performance of GPT-3 (14.0%) in zero-shot settings
Triples temporal fact performance on DATESET (27.3 vs 0.8 for GPT-3) by leveraging a calendar tool

Breakthrough Assessment

9/10

Pioneered a scalable, self-supervised method for generalized tool use. It demonstrated that smaller models with tools can beat massive models, shifting the paradigm from 'bigger is better' to 'smarter tool use'.

⚙️ Technical Details

Problem Definition

Setting: Self-supervised language modeling augmented with potential API calls

Inputs: Plain text dataset C

Outputs: Augmented dataset C* containing interleaved API calls and results

Pipeline Flow

Sample API Calls (M samples potential calls via prompting)
Execute API Calls (External tools return results)
Filter API Calls (Keep calls that reduce loss on future tokens)
Fine-tune Model (Train M on the augmented dataset)

System Modules

API Sampler

Generate candidate API calls at various positions in text using few-shot prompts

Model or implementation: GPT-J (6.7B)

Tool Executor

Run the sampled calls against external tools to get text results

Model or implementation: Various (Atlas, BM25, Python script, NLLB)

Filtering Module

Compare loss with and without the API result to determine utility

Model or implementation: GPT-J (6.7B)

Inference Generator

Generate text and interrupt decoding to call APIs when the '->' token is predicted

Model or implementation: Toolformer (Fine-tuned GPT-J)

Novel Architectural Elements

Loss-based filtering criterion: L_minus - L_plus >= threshold, measuring if an API call effectively reduces 'surprise' on subsequent tokens
Self-supervised annotation loop: The model acts as both the annotator (proposing calls) and the judge (evaluating utility via perplexity) on the exact pre-training corpus

Modeling

Base Model: GPT-J (6.7B)

Training Method: Standard language modeling fine-tuning on augmented dataset

Objective Functions:

Purpose: Select useful API calls.

Formally: Filter if min(L(no_call), L(call_no_result)) - L(call_with_result) >= filtering_threshold
Purpose: Fine-tune the model to generate text and tool calls.

Formally: Standard cross-entropy loss on dataset C*

Adaptation: Full fine-tuning

Trainable Parameters: 6.7B

Training Data:

Subset of CCNet augmented with API calls
Heuristic filtering used to select relevant texts for specific tools (e.g., texts with numbers for calculator)

Key Hyperparameters:

batch_size: 128
learning_rate: 1e-5
decoding_top_k: 10 (during inference for API triggering)
+ 3 more
filtering_threshold_calculator: 1.0
filtering_threshold_qa: 1.0
warmup_ratio: 0.1

Compute: Not reported in the paper

Comparison to Prior Work

vs. TALM: Toolformer learns tool use on the general pre-training corpus (CCNet) rather than downstream tasks, maintaining generality
vs. LaMDA: Toolformer is self-supervised and does not require large-scale human annotation of tool sequences
vs. WebGPT: Toolformer does not require a reward model or human preferences; it relies solely on language modeling perplexity
+ 1 more
vs. PAL [not cited in paper]: PAL offloads reasoning entirely to Python programs via few-shot prompting, whereas Toolformer integrates tool outputs back into the generation flow via fine-tuning

Limitations

Cannot use tools in a chain (output of one tool as input to another) due to independent sampling
Cannot interact with tools (e.g., browsing search results or refining queries)
Sample inefficient; requires processing millions of documents to find a few thousand useful calculator examples
Sensitive to exact wording of inputs when deciding to call an API
Does not account for computational cost of API calls during inference

Reproducibility

Not provided. The paper does not mention a specific public code repository. Prompt templates are provided in Appendix A.2.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on downstream tasks without task-specific fine-tuning or in-context examples

Benchmarks:

LAMA (SQuAD, Google-RE, T-REx) (Factual probing / Cloze completion)
ASDiv (Math word problems)
SVAMP (Math word problems)
MAWPS (Math word problems)
Web Questions (Question Answering)
Natural Questions (Question Answering)
TriviaQA (Question Answering)
MLQA (Multilingual Question Answering)
TEMPLAMA (Temporal factual probing)
DATESET (Date/Time reasoning) [New]

Metrics:

Accuracy (or relaxed match within top-5/top-20 tokens)
Perplexity (for language modeling checks)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LAMA results show Toolformer significantly outperforming baselines on factual knowledge by leveraging the QA tool.
LAMA (SQuAD)	Top-5 Accuracy	17.8	33.8	+16.0
LAMA (T-REx)	Top-5 Accuracy	31.9	53.5	+21.6
Math benchmarks demonstrate massive gains from the Calculator tool, surpassing much larger models.
ASDiv	Accuracy	7.5	40.4	+32.9
SVAMP	Accuracy	5.2	29.4	+24.2
Temporal reasoning results show the Calendar tool enables solving tasks that are impossible for static models.
DATESET	Top-5 Accuracy	3.9	27.3	+23.4
Language modeling checks confirm that adding tool capabilities does not degrade core text generation.
WikiText	Perplexity	9.9	10.3	+0.4

Experiment Figures

Performance scaling laws for tool use across GPT-2 model sizes (124M to 1.6B) and GPT-J.

Main Takeaways

Toolformer significantly improves zero-shot performance across factual, mathematical, and temporal tasks by autonomously deciding to use tools.
The method scales performance beyond model size: a 6.7B model often beats 66B and 175B baselines when equipped with tools.
Tool capability emerges with size; applying the same method to smaller GPT-2 models shows that tool use only becomes effective around 775M parameters.
Fine-tuning on the tool-augmented dataset does not degrade the model's core language modeling capabilities (perplexity remains stable).

📚 Prerequisite Knowledge

Prerequisites

In-context learning (few-shot prompting)
Language model fine-tuning
Perplexity / Cross-entropy loss
Retrieval-Augmented Generation (RAG)

Key Terms

perplexity: A measurement of how well a probability model predicts a sample; lower perplexity indicates the model is less 'surprised' by the text

in-context learning: Providing a language model with a few examples of a task within the prompt to guide its generation without updating weights

API: Application Programming Interface—a mechanism for the model to interact with external tools like calculators or search engines

BM25: A ranking function used by search engines to estimate the relevance of documents to a given search query

greedy decoding: A text generation strategy where the model always selects the highest-probability next token

CCNet: A large web-crawled dataset used for training high-quality language models

LAMA: LAnguage Model Analysis—a benchmark for testing the factual and commonsense knowledge contained in language models

zero-shot: Evaluating a model on a task without providing any examples of that specific task in the prompt