← Back to Paper List

Scalable Training of Mixture-of-Experts Models with Megatron Core

Zijie Yan, Hongxiao Bai, Xin Yao, Dennis Liu, Tong Liu, Hongbin Liu, Pingtian Li, Evan Wu, Shiqing Fan, Li Tao, Robin Zhang, Yuzhong Wang, Shifang Xu, Jack Chang, Xuwen Chen, Kunlun Li, Yan Bai, Gao Deng, Nan Zheng, Vijay Anand Korthikanti, Abhinav Khattar, Ethan He, Soham Govande, Sangkug Lym, Zhongbo Zhu, Qi Zhang, Haochen Yuan, Xiaowei Ren, Deyu Fu, Tailai Ma, et al.
NVIDIA Corporation
arXiv (2026)
Pretraining Memory RL

📝 Paper Summary

Distributed Training Systems Mixture of Experts (MoE)
Megatron-Core MoE addresses the specific memory, communication, and compute bottlenecks of sparse models through co-designed optimizations like Parallel Folding and DeepEP, enabling efficient training of trillion-parameter architectures.
Core Problem
Training MoE models creates a 'Parameter-Compute Mismatch' where total parameters grow much faster than active computation, causing severe fragmentation and communication bottlenecks that standard dense-model frameworks cannot handle.
Why it matters:
  • Standard parallelism strategies assume parameters and compute scale linearly; MoE's sparsity breaks this, making naive sharding inefficient
  • The 'Memory Wall': Storing all expert parameters while only activating a few per token creates immense pressure that exceeds dense model requirements
  • The 'Communication Wall': Expert Parallelism requires massive all-to-all token routing that can consume up to 60% of training time if unoptimized
Concrete Example: DeepSeek-V3 has 685B total parameters but only 37B active per token (an 18× gap). Naively sharding this fragmentation results in tiny matrix multiplications that underutilize GPUs, while the necessary token routing floods inter-node interconnects.
Key Novelty
Integrated System Co-design for MoE 'Three Walls'
  • Parallel Folding: A technique that decouples the parallelism configurations of attention layers from MoE layers, allowing optimal but conflicting layouts (e.g., Expert Parallelism vs. Data Parallelism) to coexist.
  • DeepEP/HybridEP: Specialized communication dispatchers that maximize bandwidth usage during the all-to-all token routing phase, specifically designed for the sparse, high-volume nature of expert routing.
  • Three-Wall Optimization: Simultaneously tackling memory (fine-grained recomputation), communication (overlap), and compute (Grouped GEMM) to prevent fixing one bottleneck from simply shifting pressure to another.
Evaluation Highlights
  • Achieves 1,233 TFLOPS/GPU when training DeepSeek-V3-685B on NVIDIA GB300 GPUs
  • Maintains 1,048 TFLOPS/GPU for DeepSeek-V3-685B on NVIDIA GB200 GPUs
  • Achieves 974 TFLOPS/GPU for Qwen3-235B on NVIDIA GB300 GPUs
Breakthrough Assessment
9/10
Provides a comprehensive, production-grade solution to the fundamental systems challenges of MoE training (the 'Three Walls'), backed by state-of-the-art performance numbers on next-gen hardware.
×