MLA: Multi-head Latent Attention—a novel attention mechanism that projects keys and values into a low-rank latent vector to compress memory usage while maintaining performance.
DeepSeekMoE: A specific MoE architecture using fine-grained expert segmentation (many small experts) and shared expert isolation (some experts always active) to improve specialization.
KV Cache: Key-Value Cache—storing the calculated Key and Value vectors for previous tokens during text generation to avoid re-computation.
MoE: Mixture-of-Experts—a model architecture where only a subset of parameters (experts) are activated for each token, saving compute.
RoPE: Rotary Position Embedding—a method for encoding token positions by rotating their vector representations.
Decoupled RoPE: A strategy in MLA where positional information is applied to a separate, shared vector rather than the compressed latent vectors, preserving the ability to absorb projection matrices.
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used for alignment that optimizes based on group-relative rewards.
SFT: Supervised Fine-Tuning—training the model on high-quality instruction-response pairs.
MHA: Multi-Head Attention—the standard attention mechanism in Transformers with separate heads for Query, Key, and Value.
GQA: Grouped-Query Attention—an attention variant where multiple query heads share a single key/value head to save memory.
MQA: Multi-Query Attention—an extreme case of GQA where all query heads share a single key/value head.