MoE: Mixture of Experts—a neural network architecture where different parts of the model (experts) specialize in different tasks or data patterns
Dense MoE: A variation of MoE where all experts are activated and computed for every input, rather than selecting a sparse subset
Sparse MoE: The traditional MoE approach where only a few experts (e.g., top-2) are active per token to save compute
SwiGLU: A specific activation function (Swish-Gated Linear Unit) commonly used in modern LLMs like LLaMA
RoPE: Rotary Positional Embeddings—a method for encoding position information in Transformers that generalizes better to longer sequences
Self-QA: A data generation method where the model generates its own question-answer pairs from unsupervised text to create instruction-tuning data
Hybrid-tuning: A fine-tuning strategy that mixes pre-training data (completion) with instruction data (chat) to prevent catastrophic forgetting
RMSNorm: Root Mean Square Layer Normalization—a normalization technique used to stabilize training in deep networks