MoE: Mixture-of-Experts—a model architecture where only a subset of parameters (experts) are activated for each token, improving efficiency.
Muon: A momentum-based orthogonal optimizer designed to be more token-efficient than AdamW by orthogonalizing updates.
MLA: Multi-Head Latent Attention—an attention mechanism that compresses Key-Value heads into a latent vector to reduce memory usage during inference.
QK-Clip: A technique proposed in this paper that rescales Query and Key weights post-update if attention logits exceed a threshold, preventing training instability.
RLVR: Reinforcement Learning with Verifiable Rewards—training models using tasks where the outcome can be programmatically checked (e.g., code execution, math answers).
Agentic Intelligence: The capability of an AI to perceive, plan, reason, and act autonomously in dynamic environments using tools.
Sparsity: In MoE, the ratio of total experts to activated experts; higher sparsity means fewer parameters are used relative to the total available.
1F1B: One-Forward-One-Backward—a pipeline parallelism schedule that interleaves forward and backward passes to reduce memory bubbles.