| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Comparative analysis against S1 (state-of-the-art length control baseline) showing massive gains at lower token budgets. | ||||
| Math Reasoning (Avg) | Accuracy (Relative Gain) | Not reported in the paper | Not reported in the paper | Not reported in the paper |
| Short Reasoning Models (SRMs) comparison: L1 vs much larger models at equivalent short generation lengths. | ||||
| Average (MATH, AMC, AIME, etc.) | Accuracy | 48.3 | 50.3 | +2.0 |
| Average (MATH, AMC, AIME, etc.) | Accuracy | 45.0 | 50.3 | +5.3 |
| Length controllability analysis. | ||||
| Math Datasets (Avg) | Mean Length Error | Not reported in the paper | 0.03 | Not reported in the paper |