Evaluation Setup
Tool manipulation using API calls, evaluated via execution success rate
Benchmarks:
- ToolBench (Tool Manipulation (API Call Generation)) [New]
Metrics:
- Success Rate (Execution-based)
- Reward (for WebShop)
- Executability
- Longest Common Subsequence (LCS)
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Enhanced open-source models (tuned + retriever + system prompt) show massive gains over zero-shot baselines and become competitive with GPT-4 on simpler tasks. |
| Home Search |
Success Rate |
0.0 |
87.0 |
+87.0
|
| Open Weather |
Success Rate |
39.0 |
100.0 |
+61.0
|
| Trip Booking |
Success Rate |
0.0 |
85.8 |
+85.8
|
| WebShop |
Reward |
22.0 |
31.0 |
+9.0
|
| Google Sheets |
Success Rate |
5.9 |
21.2 |
+15.3
|
| Ablation studies reveal that model alignment (fine-tuning) is the most critical component for performance. |
| Average across tasks (Task Count) |
Tasks Improved |
0 |
-5 |
-5
|
Main Takeaways
- Open-source LLMs can be boosted to match GPT-4 on specific tool use tasks using a combination of synthetic data alignment, retrieval, and system prompts.
- Model alignment (fine-tuning) is the most impactful factor, addressing API selection and argument population failures.
- In-context demonstration retrieval is essential for generalizing to unseen API combinations, requiring only linear O(n) examples.
- A significant gap remains on complex reasoning tasks (Google Sheets, Tabletop) where open-source models still lag behind GPT-4 even with enhancements.