MoE: Mixture of Experts—a neural network architecture where different parts of the network (experts) are activated for different inputs to save computation.
MLA: Multi-head Latent Attention—an attention mechanism that uses low-rank compression for Key and Value matrices to reduce memory usage during inference.
RoPE: Rotary Position Embeddings—a method to encode token position information by rotating query and key vectors, allowing better length generalization.
KV cache: Key-Value cache—storing calculated attention keys and values during text generation to avoid recomputing them for every new token.
FLOPs: Floating Point Operations—a measure of computational cost.
auxiliary-loss-free load balancing: A method to ensure all experts in an MoE are used roughly equally without adding a separate loss term that might conflict with the main training objective.
top-k routing: Selecting only the k highest-scoring experts to process a given token.
perplexity: A metric measuring how well a probability model predicts a sample; lower is better.