Mixture-of-Experts (MoE): A neural network architecture where different parts of the network (experts) are activated for different inputs, allowing huge parameter counts with low compute.
GShard: A standard baseline MoE architecture that activates the Top-K experts out of N total experts using a gating mechanism.
Fine-Grained Expert Segmentation: DeepSeek's method of splitting one large FFN expert into 'm' smaller experts and activating 'm' times more experts to maintain constant compute while increasing routing flexibility.
Shared Expert Isolation: Designating specific experts to process every single token, intended to capture common/shared knowledge distinct from specialized contexts.
Routing Collapse: A failure mode in MoE training where the gate always selects the same few experts, leaving others untrained.
Load Balancing Loss: An auxiliary loss function added to training to ensure experts receive a roughly equal number of tokens, preventing routing collapse.
Top-K Routing: A strategy where the K experts with the highest router scores are selected to process a token.
Knowledge Hybridity: The problem where a single expert is forced to learn diverse, unrelated types of knowledge because the routing is too coarse.
SFT: Supervised Fine-Tuning—training a pre-trained model on labeled instruction-response pairs.