KV cache: A memory optimization that stores calculated Key and Value vectors for past tokens so they don't need to be recomputed at every step
Reasoning models: LLMs trained to generate long 'Chain-of-Thought' sequences to solve complex problems (e.g., OpenAI o1, DeepSeek-R1)
FlashAttention: An algorithm that speeds up attention by tiling computations to minimize memory access (I/O) between slow HBM and fast SRAM
Triton: A programming language and compiler for writing highly efficient custom GPU kernels
Prefill phase: The initial phase of processing the user's prompt
Decoding phase: The sequential generation of new tokens, one by one
HBM: High Bandwidth Memory—the main memory on a GPU, slower than the on-chip SRAM
SRAM: Static Random Access Memory—small, ultra-fast on-chip memory used for intermediate computations
Lipschitz continuity: A property of functions (like softmax) that limits how fast they can change, used here to bound the error of the importance approximation