MoE: Mixture-of-Experts—a model architecture where different parts of the network ('experts') are activated for different inputs to save compute
KV cache: Key-Value cache—storing calculated attention keys and values during text generation to avoid recomputing them for previous tokens
MLA: Multi-head Latent Attention—the paper's proposed attention mechanism that compresses keys and values into a low-rank latent vector
DeepSeekMoE: A specific MoE architecture using fine-grained expert segmentation and shared expert isolation
RoPE: Rotary Position Embedding—a method to encode positional information into the attention mechanism
MHA: Multi-Head Attention—the standard attention mechanism in Transformers
GQA: Grouped-Query Attention—an optimization where multiple query heads share a single key-value head
MQA: Multi-Query Attention—an extreme optimization where all query heads share a single key-value head
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used for alignment
SFT: Supervised Fine-Tuning—training on labeled instruction-following data
Decoupled RoPE: A strategy in MLA where positional embeddings are applied to a separate vector to avoid interfering with low-rank compression