| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Performance of Frugal-Thinking models compared to baselines on key math benchmarks. Note: Baselines like QwQ-32B are significantly larger or specialized. | ||||
| AIME25 | Pass@1 | 50.0 | 70.0 | +20.0 |
| MATH-500 | Pass@1 | 90.6 | 92.2 | +1.6 |
| AIME25 | Pass@1 | 19.3 | 53.3 | +34.0 |
| AIME25 | Avg Output Length (Tokens) | 7665 | 15462 | +7797 |