MoE: Mixture of Experts—a neural network architecture where different subsets of parameters (experts) are activated for different inputs
Latent Space: A compressed, lower-dimensional representation of data where essential features are preserved
SVD: Singular Value Decomposition—a mathematical method to factorize a matrix, often used to approximate a matrix with lower rank (fewer parameters)
FFN: Feed-Forward Network—the fully connected layers within a Transformer block where MoE is typically applied
Rank: The dimension of the vector space generated by the columns of a matrix; lowering rank reduces the number of independent parameters needed to define the matrix
GSM8K: A benchmark dataset of grade school math word problems used to evaluate reasoning capabilities
MMLU: Massive Multitask Language Understanding—a benchmark covering 57 subjects like math, history, and law
Wikitext-2: A language modeling benchmark used to evaluate the perplexity (predictive uncertainty) of a model