KV Cache: Key-Value Cache—storing calculated intermediate states of past tokens to avoid re-computing them during text generation
Sink tokens: The first few tokens of a sequence (e.g., the start token) which collect disproportionate attention mass and are crucial for stabilizing the model
Within-sentence support stability: The phenomenon where an LLM's attention focuses on the same set of past tokens throughout the generation of a single sentence or semantic span
Slow Step: A decoding step where the model performs dense, full-context attention to identify which past memories are currently relevant
Fast Step: A decoding step where the model attends only to a small, pre-selected subset of memory (Sparse Cache), drastically reducing computation
Selector: A proposed module that ranks and selects which tokens to keep in the sparse cache using a mix of current attention evidence and statistical priors
Soft-NMS: Soft Non-Maximum Suppression—a technique to reduce redundancy by lowering the scores of tokens that are very close to a higher-scoring token
CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer