SFT: Supervised Fine-Tuning—training a model on high-quality instruction-response pairs to teach it to follow instructions
DPO: Direct Preference Optimization—an alignment method that optimizes the model to prefer chosen responses over rejected ones without training a separate reward model
RoPE: Rotary Positional Embeddings—a method to encode positional information in Transformers, allowing for better handling of sequence lengths
GQA: Grouped Query Attention—an attention mechanism that shares key-value heads across multiple query heads to reduce memory bandwidth and improve inference speed
SwiGLU: A specific activation function (Swish-Gated Linear Unit) used in the feed-forward networks of the Transformer
Tiktoken: A fast BPE tokenizer library used by OpenAI and adopted here with modifications
Rejection Sampling: A technique where the model generates multiple outputs, the best is selected (by a reward model or heuristic), and the model is fine-tuned on that selected output
Scaling Laws: Empirical relationships between model size, dataset size, and compute budget that predict model performance
Annealing: A training phase at the very end where the learning rate is decayed to 0 and data quality is upsampled to boost final performance
IsoFLOPs: Curves showing the trade-off between model size and training tokens for a fixed compute budget
4D Parallelism: Combining Tensor, Pipeline, Context, and Data Parallelism to distribute training across thousands of GPUs
FSDP: Fully Sharded Data Parallelism—a technique that shards model parameters, gradients, and optimizer states across data parallel workers to save memory