Linguistic Entity Masking to Improve Cross-Lingual Representation of Multilingual Language Models for Low-Resource Languages

📝 Paper Summary

Multilingual Language Models Low-Resource NLP

The paper introduces Linguistic Entity Masking (LEM), a pre-training strategy that selectively masks single tokens from Named Entities, Nouns, and Verbs to improve cross-lingual representations for low-resource languages.

Core Problem

Standard Masked Language Modeling (MLM) and Translation Language Modeling (TLM) mask tokens randomly, ignoring linguistic importance, which leads to suboptimal cross-lingual representations for low-resource languages (LRLs).

Why it matters:

Current multilingual models (like XLM-R) often underperform on LRLs due to a lack of explicit cross-lingual alignment objectives beyond random masking.
Morphologically rich LRLs suffer from over-segmentation into many sub-words; masking long spans of these sub-words destroys context needed for learning.
Bitext mining and parallel data curation are critical for training Neural Machine Translation systems for LRLs but rely heavily on high-quality sentence embeddings.

Concrete Example: In the sentence 'Jack walks towards the road', 'Jack' (NE) and 'walks' (Verb) carry the most semantic weight. Standard MLM might randomly mask 'towards', which is less informative. LEM ensures 'Jack' or 'walks' is masked to force the model to focus on semantically dense tokens.

Key Novelty

Linguistic Entity Masking (LEM)

Targeted Masking: Instead of random selection, LEM specifically targets Named Entities (NEs), Nouns, and Verbs for masking because they hold higher prominence and attention weights in sentences.
Single Token Constraint: Unlike span masking, LEM masks only a *single* token from a multi-token linguistic entity. This preserves more context, which is crucial for morphologically rich languages where words split into many sub-words.

Architecture

The two-stage continual pre-training process for XLM-R using LEM.

Evaluation Highlights

XLM-R continually pre-trained with LEM outperforms the MLM+TLM baseline on bitext mining recall across English-Sinhala, English-Tamil, and Sinhala-Tamil pairs.
Parallel data curated using LEM-based embeddings improves Neural Machine Translation (NMT) performance (measured by ChrF) compared to baselines.
LEM improves code-mixed sentiment analysis F1 scores for English-Sinhala compared to standard XLM-R baselines.

Breakthrough Assessment

5/10

A solid incremental improvement for low-resource languages. It refines existing masking strategies with linguistic intuition but doesn't propose a radical new architecture. Evaluation is limited to three language pairs.

⚙️ Technical Details

Problem Definition

Setting: Continual pre-training of multilingual models to improve cross-lingual sentence representations

Inputs: Monolingual corpora and parallel sentence pairs (bitexts)

Outputs: Contextualized token embeddings and sentence representations

Pipeline Flow

Tokenization & Linguistic Tagging (Identify NEs, Nouns, Verbs)
LEM_mono Continual Pre-training (Monolingual Data)
LEM_para Continual Pre-training (Parallel Data)
Downstream Task Inference (Bitext Mining / Sentiment Analysis)

System Modules

Linguistic Tagger (Input Processing)

Identify linguistic entities (NEs, Nouns, Verbs) in input sentences

Model or implementation: Flaire (English), TnT (Sinhala), ThamizhiUDp (Tamil)

Masking Module (Input Processing)

Apply LEM strategy: select single tokens from identified entities to mask

Model or implementation: Rule-based sampler

Encoder

Generate contextual embeddings from masked input

Model or implementation: XLM-R (Base)

Novel Architectural Elements

Modification of the masking pipeline to strictly sample single tokens from specific linguistic categories (NE, Noun, Verb) rather than random or span-based sampling.

Modeling

Base Model: XLM-R (Base)

Training Method: Continual Pre-training

Objective Functions:

Purpose: Predict masked tokens from monolingual data using linguistic entity prioritization.

Formally: Minimize Cross-Entropy Loss for LEM_mono.
Purpose: Predict masked tokens from concatenated parallel sentences using linguistic entity prioritization.

Formally: Minimize Cross-Entropy Loss for LEM_para.

Training Data:

Monolingual: 60k-500k sentences from MADLAD-400 and SiTa-Trilingual dataset per language
Parallel: English-Sinhala, English-Tamil, Sinhala-Tamil pairs

Key Hyperparameters:

masking_rate: 15%
corruption_rule: 80% [MASK], 10% random, 10% original

Compute: Not reported in the paper

Comparison to Prior Work

vs. Entity/Phrase Masking: LEM masks only a *single* token within the entity span rather than the whole phrase, preserving more context.
vs. Span Masking: LEM targets linguistically informative words specifically, whereas Span Masking is random.
vs. Standard MLM/TLM: LEM prioritizes NEs, Nouns, and Verbs over function words or random tokens.

Limitations

Depends on the quality of external POS taggers and NER tools, which may be poor for very low-resource languages.
Evaluated only on three language pairs (English-Sinhala, English-Tamil, Sinhala-Tamil).
Requires linguistic preprocessing (tagging) which adds computational overhead compared to random masking.

Reproducibility

Code availability is not provided. The paper relies on specific linguistic taggers (TnT for Sinhala, ThamizhiUDp for Tamil) which are external dependencies.

📊 Experiments & Results

Evaluation Setup

Continual pre-training followed by downstream tasks on LRL pairs.

Benchmarks:

Bitext Mining (Sentence Retrieval)
Parallel Data Curation (NMT) (Machine Translation)
Code-Mixed Sentiment Analysis (Classification)

Metrics:

Recall (for Bitext Mining)
ChrF (for NMT quality)
F1 Score (for Sentiment Analysis)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ablation studies on masking strategies show that LEM (single token masking) outperforms Span and Whole Word masking for bitext mining.
Bitext Mining (En-Si)	Recall	0.85	0.94	+0.09
Bitext Mining (En-Si)	Recall	0.90	0.94	+0.04
Downstream task performance: NMT models trained on data curated by LEM-based retrievers show improvements in translation quality (ChrF).
Flores+ devtest (En-Si)	ChrF	46.1	46.4	+0.3

Main Takeaways

Targeting linguistically prominent words (Named Entities, Verbs, Nouns) for masking is more effective than random masking for cross-lingual representation learning.
Masking a single token within a linguistic entity is superior to masking the entire entity span for morphologically rich low-resource languages, likely because it preserves more context.
Using 'dependent' monolingual data (source/target sides of parallel data) is more effective for continual pre-training than using 'independent' monolingual data.
The method generalizes to code-mixed sentiment analysis, suggesting improved alignment of embeddings even for mixed-language inputs.

📚 Prerequisite Knowledge

Prerequisites

Masked Language Modeling (MLM)
Translation Language Modeling (TLM)
Tokenization (sub-word units)
Cross-lingual sentence retrieval / Bitext mining

Key Terms

LEM: Linguistic Entity Masking—the proposed strategy of masking single tokens from Named Entities, Nouns, and Verbs.

MLM: Masked Language Modeling—a pre-training objective where random tokens in a sentence are masked and the model must predict them.

TLM: Translation Language Modeling—an extension of MLM using concatenated parallel sentences, allowing the model to attend to the translation context.

Bitext mining: The task of automatically finding parallel sentence pairs (translations) from two large monolingual corpora.

Code-mixed: Text that alternates between two or more languages within the same sentence or utterance.

ChrF: Character n-gram F-score—an automatic evaluation metric for machine translation that correlates well with human judgment, especially for morphologically rich languages.

LRL: Low-Resource Language—a language with limited available training data (text or parallel corpora).