| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Tool-Star demonstrates superior performance on mathematical reasoning tasks compared to both base models and strong baselines. | ||||
| MATH500 | Pass@1 | 60.6 | 65.4 | +4.8 |
| AIME24 | Pass@1 | 2.8 | 15.5 | +12.7 |
| Tool-Star also excels in knowledge-intensive QA tasks requiring search. | ||||
| HotpotQA | Pass@1 | 48.6 | 56.4 | +7.8 |
| MATH500 | Pass@1 | 58.2 | 65.4 | +7.2 |