KV Cache: A memory store holding Key and Value matrices for past tokens in a Transformer to speed up generation.
GQA: Grouped Query Attention—a method where multiple query heads share a single key/value head to reduce memory usage.
Retrofitting: Adapting a pre-trained model to a new capability (here, compression) via short continued training on a small subset of data.
Gumbel-sigmoid: A continuous relaxation of the discrete sigmoid function allowing gradient descent through binary decisions (like 'append' vs 'accumulate').
HBM: High Bandwidth Memory—the fast memory on GPUs where model weights and KV caches are stored; access speed here often limits LLM speed.
GEMM: General Matrix Multiply—the fundamental mathematical operation in neural networks.
Memory-bound: A computing scenario where execution speed is limited by how fast data can be moved from memory, not how fast the processor can calculate.
MHSA: Multi-Head Self-Attention—the core mechanism in Transformers allowing tokens to attend to other tokens.
H2O/TOVA: Baseline cache eviction policies that drop less important tokens from the KV cache based on attention scores.