MoE: Mixture-of-Experts—a neural network architecture where different parts of the network (experts) are activated for different inputs
Flan: Finetuned Language Net—a methodology for instruction-tuning language models on a large collection of tasks
FLOPs: Floating Point Operations—a measure of computational cost
MMLU: Massive Multitask Language Understanding—a benchmark covering 57 subjects like math, history, and law
BBH: BIG-Bench Hard—a subset of 23 challenging tasks from the BIG-Bench benchmark
CoT: Chain-of-Thought—a prompting method where the model generates reasoning steps before the final answer
routing strategy: The mechanism determining which expert processes a given token (e.g., token-choice vs. expert-choice)
ST-MoE: Switch Transformer MoE—a specific sparse MoE architecture used as a base
expert-choice: A routing strategy where experts select the top-k tokens they want to process, ensuring balanced load
token-choice: A routing strategy where each token selects the top-k experts to process it