KV cache: Stored Key and Value vectors from previous tokens in an LLM generation sequence, reused to avoid recomputation
PagedAttention: A memory management technique that splits the KV cache into non-contiguous blocks to reduce fragmentation, requiring specialized attention kernels
CUDA VMM: CUDA Virtual Memory Management—Low-level APIs allowing explicit control over virtual address reservation and physical memory mapping on NVIDIA GPUs
TLB: Translation Lookaside Buffer—A hardware cache used to reduce the time taken to access a user memory location
FlashAttention: A highly optimized, IO-aware exact attention algorithm that typically expects contiguous memory inputs
FlashInfer: A kernel library for LLM serving offering high-performance attention implementations
Internal fragmentation: Wasted memory space within allocated blocks (e.g., reserving max context length when only a fraction is used)