| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Scaling law fit quality demonstrates high predictive accuracy for loss across different model sizes and token budgets. | ||||
| Validation Loss | R2 | N/A | 0.982 | N/A |
| Downstream performance comparison between split models (using optimal allocation) and full pretraining. | ||||
| Average Zero-shot QA (8 tasks) | Accuracy | 55.89 | 58.73 | +2.84 |
| Pile (5 domains) | Perplexity | 3.55 | 3.22 | -0.33 |
| Comparison of 2.7B split model against baselines. | ||||
| Average Zero-shot QA | Accuracy | 63.2 | 63.8 | +0.6 |
| Impact of routing strategy and cluster count on performance. | ||||
| Average Zero-shot QA | Accuracy | 58.73 | 59.25 | +0.52 |