| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Tool-N1 models consistently outperform proprietary and open-source baselines on the primary BFCL benchmark. | ||||
| BFCL (Overall) | Accuracy | 83.97 | 84.82 | +0.85 |
| BFCL (Overall) | Accuracy | 83.97 | 85.97 | +2.00 |
| BFCL (Overall) | Accuracy | 81.88 | 84.82 | +2.94 |
| Tool-N1 demonstrates strong generalization on additional benchmarks API-Bank and ACEBench. | ||||
| API-Bank | Accuracy | 77.16 | 82.19 | +5.03 |
| ACEBench | Accuracy | 82.34 | 87.00 | +4.66 |
| Ablation study on training recipes reveals pure RL outperforms pipelines involving SFT. | ||||
| ToolACE (Subset) | Average Accuracy | 82.71 | 83.24 | +0.53 |
| ToolACE (Subset) | Average Accuracy | 83.17 | 83.24 | +0.07 |