Evaluation Setup
Pre-training language models from scratch at various scales to fit scaling laws, followed by a large-scale verification run.
Benchmarks:
- Pre-training Loss (Language Modeling)
- Downstream Tasks (General Capabilities (implied suite for Ling-mini-beta validation))
Metrics:
- Validation Loss
- Efficiency Leverage (Ratio of Dense FLOPs to MoE FLOPs for iso-loss)
- Statistical methodology: Fit power laws to experimental data from >300 models. Used 'near-optimal' filtering (loss within 0.25% of minimum) to ensure robust fitting.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Validation of the derived Efficiency Leverage scaling laws using the Ling-mini-beta model. |
| Training Loss (1T tokens) |
Performance Equivalence |
Equivalent Loss |
Equivalent Loss |
0
|
| Computational Cost |
FLOPs |
100% (Normalized) |
~14% (Normalized) |
-86%
|
| Scaling law findings regarding architectural parameters. |
| Efficiency Leverage |
Optimal Expert Granularity |
Varies |
8 to 12 |
0
|
Main Takeaways
- Efficiency Leverage (EL) is primarily driven by the expert activation ratio (lower ratio = higher EL) and total compute budget (higher budget = higher EL).
- Expert granularity modulates efficiency non-linearly; extremely fine or coarse experts differ from the optimum (found to be 8-12).
- MoE models scale better than dense models with increased compute; the efficiency gap widens as the training budget grows.
- Optimal MoE models should generally be computationally smaller (fewer active parameters) but trained on more data compared to optimal dense models for the same budget.