← Back to Paper List

Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

Junzhuo Li, Peijie Jiang, Changxin Tian, Jia Liu, Zhiqiang Zhang, Xuming Hu
The Hong Kong University of Science and Technology (Guangzhou), The Hong Kong University of Science and Technology, Ant Group
arXiv (2026)
Pretraining

📝 Paper Summary

Neural Scaling Laws Mixture-of-Experts (MoE) Architecture Design
The optimal ratio of compute allocated to expert versus attention layers in MoE models is not fixed but scales predictably with total compute budget and sparsity.
Core Problem
Current Mixture-of-Experts (MoE) designs often inherit the ratio of attention-to-feedforward compute from dense Transformers or tune it heuristically, ignoring how optimal allocation shifts with scale.
Why it matters:
  • Misallocating compute between experts and attention leads to measurable performance loss under fixed training budgets
  • Existing scaling laws (e.g., Chinchilla) assume fixed internal architectures and do not guide the expert-attention trade-off
  • As models grow, expert layers dominate the compute budget; optimizing this allocation is critical for efficiency
Concrete Example: A highly sparse MoE model trained with a fixed, dense-style compute allocation might overspend resources on expert layers that yield diminishing returns, whereas shifting that compute to attention would lower loss.
Key Novelty
Scale-and-Sparsity-Dependent Allocation Law
  • Empirically determines that the optimal FLOPs ratio (experts vs. attention) follows a power law with respect to total compute
  • Demonstrates that sparsity modulates this relationship: lower sparsity models demand steeper increases in expert compute as they scale, while higher sparsity flattens this demand
  • Incorporates this ratio into a unified scaling law equation to predict loss based on compute, sparsity, and internal allocation
Evaluation Highlights
  • Proposed scaling law accurately predicts training loss on held-out sparsity levels not seen during fitting
  • Empirical results show the optimal expert-attention ratio increases monotonically with total compute (power law behavior)
  • Coefficients of the scaling law systematically depend on the fraction of activated experts (1-S)
Breakthrough Assessment
7/10
Provides a significant refinement to MoE scaling laws by treating internal compute allocation as a dynamic variable. Offers practical design guidelines, though restricted to fixed sparsity regimes.
×