โ† Back to Paper List

Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing

Hanchi Sun, Yixin Liu, Yonghui Wu, Lichao Sun
arXiv (2026)
Pretraining Reasoning Benchmark

๐Ÿ“ Paper Summary

Mixture of Experts (MoE) Efficient Transformers Large Language Model Pretraining
Expert Threshold routing replaces batch-dependent top-k selection with a stable global threshold (estimated via exponential moving average), enabling fully causal, dynamic routing that maintains load balance without auxiliary losses.
Core Problem
Existing MoE routing methods either enforce fixed sparsity (Token Choice), which limits dynamic compute and requires complex auxiliary losses for load balancing, or violate causality (Expert Choice), making them unsuitable for autoregressive generation.
Why it matters:
  • Token Choice (TC) creates a combinatorial optimization problem where load balancing conflicts with selecting the best experts, often requiring heuristics like auxiliary losses.
  • Expert Choice (EC) achieves perfect load balance but requires looking at all tokens in a batch to select the top-k, which is impossible during autoregressive inference where future tokens are unknown.
  • Mismatched training (batch-level EC) and inference (causal) leads to performance degradation when batch sizes are small.
Concrete Example: In Expert Choice routing, to decide if the 5th token in a sequence goes to Expert A, the model must compare its score against all other tokens in the batch (including future tokens 6-100). During inference, tokens 6-100 don't exist yet, so the routing decision cannot be made identically to training.
Key Novelty
Expert Threshold (ET) Routing
  • Routes tokens based on whether their score exceeds a learned threshold, rather than competing against other tokens in the current batch.
  • Maintains the threshold as an Exponential Moving Average (EMA) of the k-th highest score from previous steps, approximating the global population distribution.
  • Decouples routing decisions for each token, making the process fully causal and consistent between training and inference while ensuring asymptotic load balancing.
Evaluation Highlights
  • Pretraining a 2.4B parameter model on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than Token Choice (TC) baselines.
  • ET achieves CORE benchmark scores of 19.88 (d12 model) and matches large-batch Expert Choice (19.94) without requiring batch coordination.
  • ET enables equivalent performance to TC while using 1.6x fewer tokens during pretraining due to faster convergence.
Breakthrough Assessment
8/10
Elegantly solves the causality vs. load-balancing trade-off in MoEs. It matches the theoretical optimality of Expert Choice without its inference limitations, offering a simpler, loss-free alternative to standard Token Choice.
×