MSA: Multi-Stream Attention—a modification to standard attention that allows a model to process a main stream (current token) and multiple speculative streams (future tokens) simultaneously.
speculative decoding: An inference technique where a cheaper method guesses future tokens, and the main model verifies them in parallel to speed up generation.
LoRA: Low-Rank Adaptation—a technique to fine-tune large models by training only small, low-rank matrices instead of all weights.
Medusa: A prior single-model speculative decoding method that uses multiple heads to predict future tokens independently.
n-gram: A contiguous sequence of n items (tokens) from a given sample of text.
tree drafting: Organizing speculated tokens into a branching tree structure rather than a single sequence, allowing the verification step to check multiple possible future paths at once.
call reduction ratio: A metric indicating how many times the computationally expensive model forward pass is avoided compared to standard decoding.
FLOPs: Floating Point Operations per Second—a measure of computer performance and computational cost.
kv cache: Key-Value cache—storing calculated attention keys and values to avoid recomputing them for previous tokens during generation.