LEM: Linguistic Entity Masking—the proposed strategy of masking single tokens from Named Entities, Nouns, and Verbs.
MLM: Masked Language Modeling—a pre-training objective where random tokens in a sentence are masked and the model must predict them.
TLM: Translation Language Modeling—an extension of MLM using concatenated parallel sentences, allowing the model to attend to the translation context.
Bitext mining: The task of automatically finding parallel sentence pairs (translations) from two large monolingual corpora.
Code-mixed: Text that alternates between two or more languages within the same sentence or utterance.
ChrF: Character n-gram F-score—an automatic evaluation metric for machine translation that correlates well with human judgment, especially for morphologically rich languages.
LRL: Low-Resource Language—a language with limited available training data (text or parallel corpora).