Evaluation Setup
Evaluated on numerical reasoning, knowledge-based QA, and embodied plan generation tasks.
Benchmarks:
- GSM8K-XL (Numerical reasoning with large numbers) [New]
- FuncQA (Complex numerical reasoning with 13 tools) [New]
- KAMEL (Wikidata) (Knowledge-based QA (234 relations))
- VirtualHome (Embodied plan generation)
Metrics:
- Accuracy (Exact Match)
- Success Rate (VirtualHome)
- Grounding Rate (VirtualHome)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Numerical reasoning results demonstrating capability with basic (4) and extended (13) toolsets. |
| GSM8K-XL (4 tools) |
Accuracy |
0.32 |
0.33 |
+0.01
|
| FuncQA One-Hop (13 tools) |
Accuracy |
0.57 |
0.73 |
+0.16
|
| FuncQA Multi-Hops (13 tools) |
Accuracy |
0.06 |
0.15 |
+0.09
|
| Knowledge-based QA (KAMEL) results showing scaling with number of tools. |
| KAMEL (30 tools) |
Accuracy |
0.48 |
0.95 |
+0.47
|
| KAMEL (234 tools) |
Accuracy |
0.20 |
0.50 |
+0.30
|
| Embodied agent planning results on VirtualHome. |
| VirtualHome |
Success Rate |
0.38 |
0.68 |
+0.30
|
Main Takeaways
- Scalability: ToolkenGPT maintains high performance as the number of tools increases (up to 200+), whereas in-context learning degrades rapidly due to context limits.
- Efficiency: Training toolken embeddings is computationally cheap (2 mins vs 40 mins for LoRA) and requires minimal GPU memory.
- Flexibility: Can effectively learn from both supervised in-domain data and synthetic data generated by LLMs.
- Generalization: Embeddings learned on simple (one-hop) tasks improve performance on complex (multi-hop) tasks, suggesting robust representation learning.