Speculative Decoding: An inference technique where a small 'draft' model proposes tokens that are verified in parallel by a large 'target' model to speed up generation.
Draft Model: A smaller, faster model (or head) used to propose candidate tokens.
Target Model: The main, large LLM whose output distribution must be matched.
Contemplate Token: A special token (aka pause token) added to the sequence to allow the model to perform extra computation or express internal reasoning states without generating visible text.
Soft Prompts: Learnable vectors prepended to the input that guide the model's behavior (here, to produce future predictions) without changing model weights.
MoE (Mixture-of-Experts): A neural architecture where different sub-networks (experts) are activated for different inputs based on a gating mechanism.
Tree Attention: An attention mechanism allowing verification of multiple branching draft token sequences in a single forward pass.
Anchor Token Sampling: A training strategy where contemplate tokens are inserted only at random positions (anchors) rather than every position to save memory.
KL-divergence: A statistical measure quantifying how one probability distribution differs from a second, reference probability distribution.