MoE: Mixture-of-Experts—a model architecture where different sub-networks (experts) are activated for different inputs, increasing capacity without increasing inference cost.
MTP: Multi-Token Prediction—a training objective where the model predicts multiple future tokens simultaneously to improve reasoning and enable speculative decoding.
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt, removing the need for a separate value network.
SFT: Supervised Fine-Tuning—training a model on labeled examples to teach it specific behaviors or formats.
CoT: Chain-of-Thought—a prompting or training technique where the model generates intermediate reasoning steps before the final answer.
ARC: Agentic, Reasoning, and Coding—the three core capabilities targeted by this model family.
Muon optimizer: A specialized optimizer for neural networks designed to accelerate convergence and handle large batch sizes efficiently.
RoPE: Rotary Positional Embeddings—a method for encoding position information in transformer models.
Self-distillation: A process where a stronger version of a model (e.g., trained via RL) generates data to train a new base version of itself.
Pareto Frontier: The set of optimal solutions where no objective can be improved without sacrificing another; here referring to the trade-off between model size and performance.