DPO: Direct Preference Optimization—an alignment method that optimizes a policy to satisfy human preferences without an explicit reward model loop
Non-embedding FLOPs/token: A metric for model scale that counts floating point operations per token excluding the embedding layer, offering a more accurate proxy for compute than parameter count
IsoFLOP profile: A method to find optimal model/data allocations by fixing total compute (FLOPs) and varying model size and training data size to minimize loss
SwiGLU: A gated activation function (Swish-Gated Linear Unit) used in the feed-forward networks of modern LLMs
GQA: Grouped-Query Attention—an attention mechanism that shares key/value heads across multiple query heads to reduce memory bandwidth usage during inference
Rotary Embedding: A positional encoding method that rotates token embeddings in a high-dimensional space to encode relative positions
Multi-step learning rate scheduler: A schedule where the learning rate drops by a fixed factor at specific milestones (steps), rather than decaying continuously like a cosine schedule
BPE: Byte-Pair Encoding—a tokenization algorithm that iteratively merges frequent pairs of bytes/characters