| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main comparisons showing ReTool's superiority over text-based RL baselines and competitive models on AIME benchmarks. | ||||
| AIME 2024 | Accuracy | 40.0 | 67.0 | +27.0 |
| AIME 2025 | Accuracy | 36.7 | 49.3 | +12.6 |
| AIME 2024 | Accuracy | 56.7 | 67.0 | +10.3 |
| AIME 2024 | Training Steps | 1080 | 400 | -680 |