mPLM: Multilingual Pre-trained Language Model—a model trained on text in many languages to learn shared representations
Script barrier: The performance gap caused by languages using different writing systems (scripts), preventing effective knowledge transfer
Transliteration: Converting text from one script to another (e.g., Cyrillic to Latin) based on phonetic similarity, without translating the meaning
Uroman: A universal romanizer tool that converts text from almost any script into Latin characters
TLM: Translation Language Modeling—an objective usually applied to parallel translation pairs, adapted here for transliteration pairs to align tokens
SimCSE: Simple Contrastive Sentence Embeddings—a framework for learning sentence vectors by pulling similar sentences together and pushing others apart
Glot500: A massively multilingual model pre-trained on over 500 languages, used here as the base model