Mamba2: A state-space model architecture that processes sequences with linear complexity, offering efficiency advantages over standard Attention mechanisms
MoE: Mixture of Experts—a neural network architecture where different parts (experts) are activated for different inputs, allowing huge total parameter counts with low inference cost
SSM: State Space Model—a mathematical framework for modeling sequence data, used here via Mamba layers to handle long contexts efficiently
GRPO: Generative Reward Preference Optimization—a reinforcement learning algorithm that optimizes policies based on group-level relative rewards rather than a separate value model per token
Chain-of-Thought (CoT): A prompting or training technique where the model generates intermediate reasoning steps before the final answer
KV Cache: Key-Value Cache—memory used during Transformer inference to store attention calculations; reducing this is crucial for long-context efficiency
GQA: Grouped-Query Attention—an attention mechanism that shares Key and Value heads across multiple Query heads to reduce memory usage
Deliberation Learning: An iterative training process where a model improves by generating candidates, having them critiqued (by judges/humans), and fine-tuning on the best outcomes