Evaluation Setup
Tool-use evaluation on the ToolBench dataset, involving instruction following and tool execution.
Benchmarks:
- ToolBench (Tool Learning / API Execution)
Metrics:
- Success Rate (SR)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance comparison on Unseen Instructions (Test Set I1) showing generalization to new queries for known tools. |
| ToolBench (I1-Inst.) |
Success Rate |
57.2 |
60.0 |
+2.8
|
| ToolBench (I1-Inst.) |
Success Rate |
58.0 |
60.0 |
+2.0
|
| Performance comparison on Unseen Tools (Test Set I2 Category) showing generalization to completely new tool categories. |
| ToolBench (I2-Cat.) |
Success Rate |
51.1 |
60.3 |
+9.2
|
| ToolBench (I2-Cat.) |
Success Rate |
46.8 |
60.3 |
+13.5
|
| Performance on Unseen Tools (Test Set I3 Tool) showing generalization to new tools within known categories. |
| ToolBench (I3-Tool) |
Success Rate |
55.6 |
65.3 |
+9.7
|
Main Takeaways
- Multi-stage curriculum effectively bridges the gap between simple execution and complex selection, improving performance on unseen tools.
- Iterative feedback (ISIF) prevents the model from overfitting to easy tools by forcing it to practice intricate ones.
- The approach generalizes well to unseen instructions and categories, outperforming proprietary models like ChatGPT in specific tool-use benchmarks.