Polysemy: The capacity for a word or phrase to have multiple meanings (e.g., 'Makloube' as a dish vs. 'flipped')
Transliteration: Representing a word from one language using the script of another (e.g., writing 'Lasagna' in Arabic script)
CBS (Cultural Bias Score): A metric measuring an LM's likelihood preference for Western over Arab entities in a neutral or Arab-specific context
mC4: Multilingual Colossal Clean Crawled Corpus—a massive dataset used for pre-training multilingual language models
Script sharing: When multiple languages (e.g., Arabic, Farsi, Urdu) use the same writing system, causing lexical overlap
NER: Named Entity Recognition—identifying categories of objects (people, places, organizations) in text
Extractive QA: Question Answering where the model must extract the answer as a span of text from the provided context
Text-infilling: A task where the model predicts missing words (masked tokens) in a sentence
Frequency-based tokenization: Algorithms like BPE that assign unique tokens to frequent character sequences; can merge polysemous words into single tokens