Evaluation Setup
Tool-use capability evaluation using held-out test set from ToolGrad-5K and OOD evaluation on ToolBench
Benchmarks:
- ToolGrad-5K Test Set (Tool-use query execution) [New]
- ToolBench (Out-of-distribution tool use)
Metrics:
- Pass rate (Data Generation)
- Number of ground-truth tool uses (Data Complexity)
- Tool Recall
- Success Rate (of tool calls)
- Quality of Response (QoR - LLM judge)
- Statistical methodology: Paired t-test used when comparing base vs. reasoning models.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Data Generation Efficiency: ToolGrad is compared against the DFS baseline from ToolBench regarding generation metrics. |
| Generation Statistics |
Pass rate |
63.8 |
100.0 |
+36.2
|
| Generation Statistics |
Avg. Ground-truth Tool Uses |
3.3 |
6.1 |
+2.8
|
| Generation Statistics |
LLM Calls per Sample |
64.5 |
45.9 |
-18.6
|
| Model Performance: Fine-tuned ToolGrad models are compared against proprietary baselines on the ToolGrad-5K test set. |
| ToolGrad-5K Test Set |
Tool Recall |
85.2 |
99.2 |
+14.0
|
| Out-of-Distribution (OOD) Performance: Models trained on ToolGrad-5K evaluated on ToolBench test set vs models trained on ToolBench. |
| ToolBench (OOD) |
Win Rate |
50.0 |
55.8 |
+5.8
|
Main Takeaways
- ToolGrad framework significantly reduces data generation costs (fewer LLM and tool calls) while increasing chain complexity and achieving a 100% pass rate.
- Small models (1B parameters) fine-tuned on ToolGrad-5K outperform much larger proprietary models (GPT-4, Claude 3.7) on in-distribution tool-use tasks.
- Models trained on ToolGrad-5K show strong OOD generalization, outperforming models trained on the original ToolBench dataset when evaluated on ToolBench.
- Reasoning models (e.g., o1-mini equivalents) were found to underperform their base counterparts on tool-use tasks in this setup.