Evaluation Setup
Tool-use evaluation across diverse benchmarks (Multi-turn, Agentic, Single-turn)
Benchmarks:
- BFCL-v4 (Function Calling (Single-turn, Multi-turn, Agentic))
- Tau-bench (Complex User-Agent Dialogue (Retail/Airline domains))
- Tau-2-bench (Agentic Dialogue where user also has tool access)
Metrics:
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance on BFCL-v4 shows Qwen3 models trained on ToolMind outperforming significantly larger baselines, particularly in multi-turn scenarios. |
| BFCL-v4 (Multi-Turn) |
Accuracy |
72.82 |
79.24 |
+6.42
|
| BFCL-v4 (Overall) |
Accuracy |
83.65 |
86.13 |
+2.48
|
| Results on Tau-bench (Retail) demonstrate strong improvements in handling complex domain-specific policies. |
| Tau-bench (Retail) |
Pass Rate |
57.8 |
71.4 |
+13.6
|
| Ablation studies confirm the necessity of both synthetic data and rigorous quality filtering. |
| Tau-bench (Avg) |
Pass Rate |
46.1 |
52.8 |
+6.7
|
| BFCL-v4 (Overall) |
Accuracy |
81.67 |
84.34 |
+2.67
|
Main Takeaways
- ToolMind significantly improves multi-turn and agentic capabilities (BFCL Multi-Turn, Tau-bench) compared to base models.
- Turn-level quality filtering is critical; removing it leads to noticeable performance drops, proving that 'correct' trajectories can still contain harmful noise.
- Combining synthesized graph-based data with augmented open-source data yields the best overall performance, suggesting complementary benefits.