SwiGLU: A gated activation function combining Swish and GLU (Gated Linear Unit) that generally improves transformer performance compared to ReLU
Rotary Embeddings (RoPE): A positional encoding method that rotates token embeddings in vector space based on their position, allowing better generalization to variable sequence lengths
RMSNorm: Root Mean Square Layer Normalization—a simplified normalization technique that stabilizes training by normalizing inputs based on their root mean square, ignoring mean centering
Chinchilla scaling laws: Empirical laws suggesting that for a fixed compute budget, model size and training data should scale equally; LLaMA deliberately violates this to optimize inference
BPE: Byte-Pair Encoding—a subword tokenization algorithm that iteratively merges the most frequent pair of adjacent characters/bytes
FlashAttention: An IO-aware exact attention algorithm that reduces memory usage and speeds up training by minimizing reads/writes between GPU HBM and on-chip SRAM
Massive Multitask Language Understanding (MMLU): A benchmark covering 57 subjects (STEM, humanities, etc.) designed to test world knowledge and problem-solving
Chain-of-thought: A prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer