| Benchmark | Metric | Baseline | This Paper | ฮ |
|---|---|---|---|---|
| Performance on unseen simulated tools demonstrates the effectiveness of the synthetic training corpus. | ||||
| Simulated Tools Subset | Overall (Machine Eval) | 16.0 | 70.0 | +54.0 |
| Simulated Tools Subset | Human Accept Rate | 25.0 | 75.0 | +50.0 |
| Performance on real-world APIs shows generalization from simulated training data to authentic scenarios. | ||||
| Real-world APIs Subset | Overall (Human Eval) | 12.3 | 61.4 | +49.1 |
| Real-world APIs Subset | Overall (Human Eval) | 7.9 | 55.3 | +47.4 |
| Out-of-distribution evaluation on multi-modal tools confirms broad generalization capabilities. | ||||
| GPT4Tools Test Set | Success Rate (SR) | 90.6 | 83.7 | -6.9 |