| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Single-turn performance comparisons showing TL-CodeLLaMA-2 outperforms open-source baselines and rivals GPT-4. | ||||
| ToolAlpaca | CF (Content Filling) | 45.00 | 60.78 | +15.78 |
| BFCL-v3 | CF (Content Filling) | 73.53 | 85.61 | +12.08 |
| RoTBench | CF (Content Filling) | 45.19 | 64.90 | +19.71 |
| Multi-turn performance on ToolEyes dataset, measuring error rates and valid response rates. | ||||
| ToolEyes (Multi-turn) | Total Error (DE + CE) | 7.49 | 5.64 | -1.85 |
| ToolEyes (Multi-turn) | Total Error (DE + CE) | 11.12 | 8.38 | -2.74 |