Evaluation Setup
Zero-shot evaluation on downstream tasks without task-specific fine-tuning or in-context examples
Benchmarks:
- LAMA (SQuAD, Google-RE, T-REx) (Factual probing / Cloze completion)
- ASDiv (Math word problems)
- SVAMP (Math word problems)
- MAWPS (Math word problems)
- Web Questions (Question Answering)
- Natural Questions (Question Answering)
- TriviaQA (Question Answering)
- MLQA (Multilingual Question Answering)
- TEMPLAMA (Temporal factual probing)
- DATESET (Date/Time reasoning) [New]
Metrics:
- Accuracy (or relaxed match within top-5/top-20 tokens)
- Perplexity (for language modeling checks)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| LAMA results show Toolformer significantly outperforming baselines on factual knowledge by leveraging the QA tool. |
| LAMA (SQuAD) |
Top-5 Accuracy |
17.8 |
33.8 |
+16.0
|
| LAMA (T-REx) |
Top-5 Accuracy |
31.9 |
53.5 |
+21.6
|
| Math benchmarks demonstrate massive gains from the Calculator tool, surpassing much larger models. |
| ASDiv |
Accuracy |
7.5 |
40.4 |
+32.9
|
| SVAMP |
Accuracy |
5.2 |
29.4 |
+24.2
|
| Temporal reasoning results show the Calendar tool enables solving tasks that are impossible for static models. |
| DATESET |
Top-5 Accuracy |
3.9 |
27.3 |
+23.4
|
| Language modeling checks confirm that adding tool capabilities does not degrade core text generation. |
| WikiText |
Perplexity |
9.9 |
10.3 |
+0.4
|
Main Takeaways
- Toolformer significantly improves zero-shot performance across factual, mathematical, and temporal tasks by autonomously deciding to use tools.
- The method scales performance beyond model size: a 6.7B model often beats 66B and 175B baselines when equipped with tools.
- Tool capability emerges with size; applying the same method to smaller GPT-2 models shows that tool use only becomes effective around 775M parameters.
- Fine-tuning on the tool-augmented dataset does not degrade the model's core language modeling capabilities (perplexity remains stable).