MoE: Mixture-of-Experts—a neural network architecture where only a subset of parameters (experts) are activated for each token, improving efficiency
CoT: Chain-of-Thought—a prompting or training technique where the model generates intermediate reasoning steps before the final answer
SFT: Supervised Fine-Tuning—training the model on labeled instruction-response pairs
RL: Reinforcement Learning—training method where the model learns to maximize a reward signal (e.g., answer correctness)
RoPE: Rotary Positional Embedding—a method for encoding positional information in transformers by rotating query and key vectors
SigLIP: Sigmoid Loss for Language Image Pre-training—a contrastive learning method for aligning image and text representations
NaViT: Native Resolution Vision Transformer—a technique to process images of arbitrary resolutions by packing patches into a sequence without padding
ZeRO: Zero Redundancy Optimizer—a memory optimization technique for distributed training of large models
Muon: A momentum-based optimizer designed for efficient large-scale training
NTP: Next Token Prediction—the standard training objective for language models
NIAH: Needle-In-A-Haystack—an evaluation measuring a model's ability to retrieve specific information from a long context window