Mid-training: A second stage of pretraining using a distinct, high-quality data mixture (annealing) with a decaying learning rate
Checkpoint Soups: A technique of averaging the weights of multiple model checkpoints derived from different training runs (e.g., different data orders) to improve performance
Z-Loss: A regularization term added to the loss function (log^2 Z) to prevent the partition function Z in softmax from becoming too large, improving stability
RLVR: Reinforcement Learning with Verifiable Rewards—using ground-truth correctness (e.g., in math problems) as the reward signal rather than a learned reward model
QK-Norm: Applying Layer Normalization to the Query and Key vectors within the attention mechanism to stabilize training
RMSNorm: Root Mean Square Layer Normalization—a simplified version of LayerNorm that scales inputs by their root mean square, used here for better stability
SwiGLU: A gated activation function (Swish-Gated Linear Unit) used in the feed-forward layers of the Transformer
RoPE: Rotary Positional Embeddings—a method for encoding position information by rotating embedding vectors in a high-dimensional space
GQA: Grouped Query Attention—an attention mechanism that shares key/value heads across multiple query heads to reduce memory usage during inference
DCLM: DataComp for Language Models—a dataset and benchmark suite for pretraining data
MMLU: Massive Multitask Language Understanding—a benchmark covering 57 subjects like math, history, and law
GSM8K: Grade School Math 8K—a dataset of high-quality linguistically diverse grade school math word problems