MoE: Mixture-of-Experts—a neural network architecture where different subsets of the model (experts) are activated for different inputs
Sparsity (S): The ratio of inactive experts to the total number of experts; higher sparsity means a smaller fraction of the model is used per token
IsoFLOP: A curve or surface representing constant computational cost (FLOPs), used to find optimal hyperparameters for that specific budget
Active Parameters (N_a): The number of parameters actually used to process a single token; determines inference cost and FLOPs per example
Total Parameters (N): The sum of all weights in the model, including those not activated for a given token; determines memory usage
CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps, effectively increasing compute per example during inference
Upstream Performance: Performance on the pretraining objective (usually next-token prediction loss or perplexity)
Downstream Performance: Performance on specific tasks (e.g., QA, reasoning) often measured via few-shot prompting after pretraining