MoE: Mixture-of-Experts—a neural architecture where different parts of the network (experts) activate for different inputs to scale parameters without scaling compute
FFN: Feed-Forward Network—the dense layers within a Transformer block where factual knowledge is hypothesized to be stored
FLOPs: Floating Point Operations—a measure of computational cost
SPMD: Single Program, Multiple Data—a parallel programming technique used to train large models across many devices
Zipfian distribution: A distribution where a few items (words) occur very frequently while most occur rarely, creating load-balancing challenges for word-specific experts
routing vocabulary: A specialized auxiliary vocabulary (distinct from the tokenizer vocabulary) used solely to determine which expert handles a token
Exact Match (EM): A metric measuring the percentage of predictions that match the ground truth answer exactly
SuperGLUE: A benchmark suite of difficult language understanding tasks
T5: Text-to-Text Transfer Transformer—a widely used encoder-decoder language model
knowledge-rich vocabulary: A vocabulary constructed from Wikidata entities and relations, prioritized by frequency, to ensure experts specialize in semantic concepts
inference latency: The time it takes for a model to generate a response