x-elm: Cross-lingual Expert Language Models—the proposed ensemble of independently trained multilingual experts
BTM: Branch-Train-Merge—a training paradigm where a model branches into independent experts that train in parallel and merge predictions at inference
HMR: Hierarchical Multi-Round training—a method to train new experts by initializing them from the most typologically similar existing expert (e.g., parent language node)
TF-IDF clustering: Grouping text data based on overlapping vocabulary frequency (Term Frequency-Inverse Document Frequency)
Typological clustering: Grouping languages based on linguistic features (syntax, phonology) using databases like WALS
LAPT: Language-Adaptive Pretraining—continuing to pretrain a model on a specific target language to improve performance
curse of multilinguality: The phenomenon where adding more languages to a fixed-capacity model degrades performance on individual languages due to parameter competition
mC4: Multilingual Colossal Clean Crawled Corpus—a massive multilingual dataset used for pretraining
perplexity: A metric measuring how well a probability model predicts a sample; lower is better