MTP: Multi-Token Prediction—a training objective where the model predicts multiple future tokens at once, encouraging planning and enabling faster inference
Pass@k: A metric that considers a problem solved if at least one correct solution is found among k generated samples
Speculative Decoding: An inference technique where a small model (or MTP head) drafts tokens quickly, which are then verified by the main model
RL: Reinforcement Learning—training a model by rewarding desired behaviors (correct answers) rather than just imitating data
SFT: Supervised Fine-Tuning—training on labeled examples to teach the model instruction following before RL
MinHash: A technique for estimating the similarity of datasets to detect and remove duplicates
RoPE: Rotary Positional Embedding—a method for encoding position information in Transformer models, allowing better extrapolation to longer sequences
GQA: Grouped-Query Attention—an efficiency technique in Transformers where multiple query heads share key/value heads to save memory
SwiGLU: A gated activation function used in modern LLMs for better performance
vLLM: A high-throughput library for LLM inference and serving
NIAH: Needle-In-A-Haystack—a test of long-context capability where a specific fact is hidden in a large amount of unrelated text