| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Results on AIME25 using Qwen3-4B-Base trained on AIME24 (Label-Free) show Evol-RL dramatically improving over the TTRL baseline. | ||||
| AIME25 | pass@1 | 4.6 | 16.4 | +11.8 |
| AIME25 | pass@16 | 18.5 | 37.9 | +19.4 |
| AIME24 | pass@16 | Not reported in the paper | Not reported in the paper | +24.2 |
| GPQA-Diamond | pass@16 | Not reported in the paper | Not reported in the paper | +15.0 |