DSA: DeepSeek Sparse Attention—an efficient mechanism using a lightweight indexer to select top-k tokens for attention, reducing computational cost
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt, eliminating the need for a separate value function
Lightning Indexer: A lightweight neural component in DSA that computes coarse relevance scores to filter tokens before full attention
MLA: Multi-Head Latent Attention—an attention variant from DeepSeek-V2 where key-value heads are compressed into a latent vector to save memory
Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer
Off-Policy: In RL, when the data used for training was generated by an older version of the policy, not the current one
KL Divergence: A statistical measure of how one probability distribution differs from another; used here to prevent the model from drifting too far from its original behavior
Cold-start: The initial phase of training (often Supervised Fine-Tuning) to bootstrap the model's capabilities before Reinforcement Learning