MoE: Mixture-of-Experts—a neural network architecture where different subsets of the network (experts) are activated for different inputs to save computation
MLA: Multi-head Latent Attention—an attention mechanism that compresses Key-Value heads into a latent vector to reduce memory usage during inference (KV cache)
MTP: Multi-Token Prediction—a training objective where the model predicts not just the next token, but several future tokens sequentially to improve representation learning
DeepSeekMoE: A specific MoE architecture using fine-grained experts (splitting one expert into many smaller ones) and shared experts (always active) to improve specialization
Auxiliary-loss-free load balancing: A strategy that ensures experts receive equal load by adjusting a bias term during routing rather than adding a penalty term to the loss function
FP8: 8-bit Floating Point—a low-precision number format used to accelerate training and reduce memory footprint
DualPipe: A pipeline parallelism schedule that overlaps forward/backward computation with communication to reduce idle time (bubbles) in distributed training
RoPE: Rotary Positional Embedding—a method for encoding position information in Transformer models by rotating the query and key vectors
SFT: Supervised Fine-Tuning—training the model on high-quality instruction-response pairs
CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer