DeepSeek Sparse Attention (DSA): A sparse attention mechanism that uses a lightweight 'lightning indexer' to select top-k relevant tokens for core attention
Lightning Indexer: A module in DSA that scores all preceding tokens to determine which ones should be attended to (computationally cheaper than full attention but still quadratic)
Top-k: A selection strategy that keeps only the k elements with the highest scores
Cross-layer stability: The empirical observation that consecutive transformer layers often attend to the same or highly similar sets of tokens
Calibration set: A small set of data used to evaluate model sensitivity to changes (like removing indexers) without full retraining
Distillation: Training a model (student) to match the output distribution of another model or objective (teacher)
MLA: Multi-head Latent Attention—the core attention mechanism used within the DSA framework
Prefill: The initial phase of LLM inference where the prompt is processed to generate the first token (often compute-bound due to context length)
Greedy search: An algorithmic approach that makes the locally optimal choice at each step (here, deciding which layer to convert to 'Shared' to minimize loss)