MoE: Mixture of Experts—a model architecture where different 'expert' sub-networks specialize in different tokens, reducing compute cost per token.
EP: Expert Parallelism—a distribution strategy where different experts are placed on different GPUs, requiring tokens to be routed to the correct GPU.
Parallel Folding: A technique in this paper that decouples attention and MoE layer parallelism configurations, breaking the traditional constraint that Expert Parallelism must equal Data Parallelism.
DeepEP: An optimized communication library/dispatcher for handling the complex all-to-all token routing required in Expert Parallelism.
Grouped GEMM: A matrix multiplication kernel that can handle multiple GEMM operations of varying sizes simultaneously, essential for the uneven workload of MoE experts.
Three Walls: The three coupled constraints in MoE training defined by the authors: Memory Wall, Communication Wall, and Compute Efficiency Wall.
TFLOPS: Trillions of Floating Point Operations Per Second—a measure of raw computational throughput.
FP8: 8-bit Floating Point—a reduced precision number format that lowers memory usage and speeds up math compared to 16-bit or 32-bit formats.
NVFP4: NVIDIA 4-bit Floating Point—an even lower precision format supported by newer hardware for extreme efficiency.
All-to-All: A collective communication operation where every GPU sends distinct data to every other GPU; used here for routing tokens to experts.