WSD: Warmup-Stable-Decay—a learning rate scheduler that maintains a constant high learning rate for a stable phase before a rapid decay (annealing)
Annealing: The final phase of training where the learning rate is decayed to zero, often accompanied by high-quality data to boost performance
SwiGLU: Swish Gated Linear Unit—an activation function that combines the Swish activation with a gating mechanism, known for better performance in LLMs
GQA: Grouped-Query Attention—an attention mechanism that groups query heads to share key-value heads, reducing memory usage and speeding up inference
RoPE: Rotary Positional Embedding—a method for encoding positional information by rotating the query and key vectors
RMSNorm: Root Mean Square Layer Normalization—a normalization technique that simplifies LayerNorm by removing the mean subtraction, improving efficiency
TEV: Token Embedding Variability—a metric measuring the variance of data entries within a vector to detect distribution shifts
MFU: Model FLOPs Utilization—the ratio of the achieved floating-point operations per second to the theoretical peak performance of the hardware