Speculative Decoding: An inference acceleration technique where a small 'draft' model proposes tokens that are verified in parallel by a large 'target' model.
Drafter: A smaller, faster model used to generate tentative token sequences.
Logits: The raw, unnormalized scores output by the final layer of a neural network before the softmax function.
GEMM: General Matrix Multiply—the fundamental operation in deep learning; here, referring to the matrix multiplication in the output projection layer.
LM Head: The final linear layer of a language model that projects hidden states to vocabulary-sized logits.
Index Selection: The process of gathering specific rows/columns from a matrix based on indices.
EAGLE: A specific framework for speculative decoding that uses a lightweight transformer layer as the drafter.
Spherical k-means: A clustering algorithm that groups data points based on cosine similarity (direction) rather than Euclidean distance.
CUDA stream: A sequence of operations that execute in order on the GPU; different streams can run concurrently.
Recall: In this context, the proportion of 'correct' (target-accepted) tokens that are included in the drafter's shortlist.