| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| STE substantially improves tool use accuracy across different base models compared to baselines. | ||||
| ToolBench (Subset) | Correctness | 30.1 | 76.8 | +46.7 |
| ToolBench (Subset) | Correctness | 60.8 | 76.8 | +16.0 |
| ToolBench (Subset) | Correctness | 37.3 | 76.8 | +39.5 |
| ToolBench (Subset) | Correctness | 36.6 | 64.9 | +28.3 |