DPO: Direct Preference Optimization—a method to align language models with human preferences without a separate reward model, used here for safety and cultural alignment
GQA: Grouped Query Attention—an attention mechanism that groups query heads to reduce memory bandwidth usage during inference
AliBi: Attention with Linear Biases—a positional encoding method that allows models to extrapolate to longer sequence lengths than seen during training
SFT: Supervised Fine-Tuning—training the pre-trained model on labeled instruction-response pairs to teach it how to follow instructions
CPT: Continual Pre-training—further training a base model on domain-specific or new language data to add capabilities without starting from scratch
token to word ratio: A measure of tokenizer efficiency; a high ratio means a single word is broken into many tokens, increasing compute cost and reducing effective context window
Common Crawl: A massive open repository of web crawl data, often used as the primary source for training large language models