| Benchmark | Metric | Baseline | This Paper | ฮ |
|---|---|---|---|---|
| Evaluation on unseen simulated tools showing ToolAlpaca matching GPT-3.5 performance. | ||||
| Simulated Subset | Overall (GPT-4 eval) | 16.0 | 70.0 | +54.0 |
| Simulated Subset | Overall (Human eval) | 25.0 | 75.0 | +50.0 |
| Evaluation on real-world APIs demonstrating generalization from simulated training data. | ||||
| Real-world Subset | Overall (Human eval) | 12.3 | 61.4 | +49.1 |
| Real-world Subset | Overall (Human eval) | 72.8 | 61.4 | -11.4 |
| Generalization to out-of-domain multi-modal tools (GPT4Tools benchmark). | ||||
| GPT4Tools Test Set | Success Rate (SR) | 26.2 | 83.7 | +57.5 |