| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main comparison on AIME 2024 showing ReTool significantly outperforming baselines with fewer training steps. | ||||
| AIME 2024 | Accuracy | 40.0 | 67.0 | +27.0 |
| AIME 2024 | Accuracy | 26.7 | 67.0 | +40.3 |
| AIME 2024 | Accuracy | 56.7 | 67.0 | +10.3 |
| Results on AIME 2025 showing generalization and performance against proprietary models. | ||||
| AIME 2025 | Accuracy | 36.7 | 49.3 | +12.6 |
| AIME 2025 | Accuracy | 37.9 | 49.3 | +11.4 |
| Cold-start performance showing the effectiveness of the data pipeline. | ||||
| AIME 2024 | Accuracy | 26.7 | 40.9 | +14.2 |