MoE: Mixture-of-Experts—a neural network architecture where only a subset of parameters (experts) are used for each input token.
Active Parameters: The number of parameters actually used to process a single token, which is much smaller than the total parameter count in MoE models.
IsoFLOP: An experimental method to find optimal model size and training data size for a fixed computational budget (FLOPs) by tracing the lowest loss curves.
Router z-loss: An auxiliary loss function that penalizes large logits in the router to improve training stability.
Load Balancing Loss: An auxiliary loss ensuring tokens are distributed roughly evenly across experts to prevent some experts from being underutilized.
Shared Experts: Specific expert modules that are always active for every token, providing a baseline computation path alongside the dynamically routed experts.
Megatron-LM: A high-performance library for training large-scale language models using various forms of parallelism.
DCLM: DataComp-LM—a large-scale open-source pretraining dataset used for training these models.