N-gram Embedding: A method to augment token representations by looking up embeddings for multi-token sequences (n-grams) using hashing, without a fixed vocabulary.
MoE: Mixture-of-Experts—a model architecture that activates only a subset of network parameters (experts) for each input, decoupling total capacity from compute cost.
Hash Collision: When different n-grams map to the same index in the embedding table due to the modulo operation, causing semantic ambiguity.
Sparsity Level: The ratio of total parameters to activated parameters during inference.
Pareto Frontier: The set of optimal trade-offs; here, the best possible loss achievable for a given computational cost or parameter budget.
Speculative Decoding: An inference technique where a smaller model drafts tokens that are verified by a larger model, speeding up generation.
Embedding Amplification: Techniques (scaling factors or LayerNorm) applied to embedding outputs to ensure their signal strength is comparable to attention outputs in the residual stream.
Polynomial Rolling Hash: A specific hash function used to map n-grams to indices efficiently.