| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| TUMIX demonstrates significant scaling gains over the base model without test-time scaling. | ||||
| HLE, GPQA, AIME (Average) | Accuracy | Not reported as a single aggregate number | Not reported as a single aggregate number | +7.8% |
| HLE, GPQA, AIME (Average) | Accuracy | Not reported as a single aggregate number | Not reported as a single aggregate number | +17.4% |
| TUMIX outperforms state-of-the-art test-time scaling baselines under equal compute budgets. | ||||
| HLE, GPQA, AIME (Average) | Accuracy | Not reported as a single aggregate number | Not reported as a single aggregate number | +3.55% |
| Deep scaling on HLE shows TUMIX surpasses strong baselines including 'Deep Research' variants. | ||||
| HLE | Accuracy | 21.6% | 34.1% | +12.5% |
| HLE | Accuracy | Not explicitly reported as single number | Not explicitly reported as single number | +1.2% |