| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| T1 (Qwen2.5-32B) outperforms strong baselines including QwQ and proprietary models on math benchmarks. | ||||
| MATH500 | Accuracy | 90.6 | 92.4 | +1.8 |
| AIME 2024 | Accuracy | 24.9 | 50.6 | +25.7 |
| Omni-MATH-500 | Accuracy | 46.6 | 49.6 | +3.0 |
| GPQA | Accuracy | 49.5 | 56.1 | +6.6 |
| Ablation studies show the impact of sampling diversity (K) during training. | ||||
| MATH500 | Accuracy | 83.0 | 86.0 | +3.0 |