Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment

📝 Paper Summary

Multilingual Representation Learning Cross-lingual Transfer

The paper proposes aligning representations across different scripts in multilingual models by post-training with transliterated data using combined masked language modeling, sequence-level contrastive learning, and token-level alignment objectives.

Core Problem

Multilingual models struggle to transfer knowledge between languages written in different scripts (the 'script barrier') because their embeddings are disjoint, even when languages are linguistically related.

Why it matters:

Low-resource languages often perform poorly in transfer tasks solely due to script differences
Existing alignment methods require parallel data (dictionaries/translations), which is scarce for many low-resource languages
Token representations from different scripts can be linearly separated, indicating a lack of a common representation space

Concrete Example: A model trained on high-resource English (Latin script) may fail to transfer performance to Amharic (Ge'ez script) or Farsi (Arabic script) despite shared vocabulary or areal features, because the model treats the different scripts as unrelated symbols.

Key Novelty

Transliteration-based Post-Training Alignment (PPA)

Uses rule-based transliteration (Uroman) to convert monolingual data into Latin script, creating pseudo-parallel data without needing translation dictionaries
Combines three distinct objectives: standard Masked Language Modeling, Sequence-Level Contrastive Learning (aligning sentence embeddings), and Token-Level Alignment (aligning word embeddings via concatenation)
Applies alignment at both sequence and token levels simultaneously, addressing limitations of prior work that focused on only one or relied on English-centric transfer

Architecture

Conceptual illustration of the Transliteration-Based Post-Training Alignment method

Breakthrough Assessment

7/10

Proposed method effectively addresses the script barrier without expensive parallel data. The combination of sequence and token-level alignment via transliteration is a logical and potentially high-impact extension of prior work.

⚙️ Technical Details

Problem Definition

Setting: Post-training alignment of pre-trained multilingual encoders to improve zero-shot cross-lingual transfer across diverse scripts

Inputs: Monolingual corpora in original scripts

Outputs: Aligned multilingual encoder capable of better cross-script transfer

Pipeline Flow

Input Processing: Original Script Data → Transliteration (Uroman) → Pseudo-Parallel Pairs
Alignment Training: mPLM Encoder → MLM Head + Contrastive Head

System Modules

Uroman Transliterater

Convert original script text to Latin script to generate pseudo-parallel views

Model or implementation: Rule-based system (Uroman)

Multilingual Encoder

Encode sequences into contextualized representations

Model or implementation: Glot500 (XLM-R based)

Contrastive Head (Alignment)

Align sentence-level representations of original and transliterated versions

Model or implementation: SimCSE-style projection/loss

MLM/TLM Head (Alignment)

Align token-level representations via masked prediction on individual and concatenated sequences

Model or implementation: Standard MLM head

Novel Architectural Elements

Integration of Transliteration Language Modeling (TLM) with Sequence-Level Contrastive Learning specifically for script alignment
Use of layer 8 specifically for sequence-level contrastive alignment in a deep transformer

Modeling

Base Model: Glot500 (based on XLM-R)

Training Method: Multi-objective fine-tuning (MLM + SimCSE + TLM) on transliterated pairs

Objective Functions:

Purpose: Preserve original knowledge and learn Latin representations.

Formally: Standard MLM loss on both original (L_MLM_orig) and transliterated (L_MLM_latn) sequences.
Purpose: Align sentence representations across scripts.

Formally: Contrastive loss L_con = -log(exp(sim(f(X), f(X+))/τ) / Σ exp(sim(f(X), f(X-))/τ)) where (X, X+) are original/transliterated pairs.
Purpose: Align token representations across scripts.

Formally: L_TLM applied to concatenated sequences X_orig ⊕ X_latn (masking tokens in one script and predicting using context from both).

Key Hyperparameters:

temperature_tau: 1
pooling_layer: 8th layer

Compute: Not reported in the paper

Comparison to Prior Work

vs. TransliCo: Adds token-level alignment (TLM) and explores non-English source languages for better lexical overlap
vs. Muller et al. (2021): Does not force a single common script; aligns original and transliterated spaces allowing flexibility
vs. Pan et al. (2021): Uses transliteration instead of translation parallel data [not cited in paper as direct baseline, but related methodology]

Limitations

Relies on the quality of the rule-based Uroman transliteration tool
Process can be lossy and non-invertible (though the method aligns original script, mitigating this)
Effectiveness depends on the existence of shared linguistic properties beyond just script differences

Reproducibility

Code: https://github.com/cisnlp/Transliteration-PPA

Code and models publicly available at https://github.com/cisnlp/Transliteration-PPA. Relies on Uroman tool for transliteration.

📊 Experiments & Results

Evaluation Setup

Zero-shot cross-lingual transfer from source languages to target languages within specific areal groups

Benchmarks:

Sentence Retrieval Tasks (Sentence-level retrieval)
Text Classification Tasks (Classification)
Sequence Labeling Tasks (Token-level labeling (e.g., NER, POS))

Metrics:

Not reported in the paper
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Method consistently improves downstream task performance across different languages and scripts compared to original mPLM baselines
Abstract reports improvements of 'up to 50%' for some tasks in English-centric transfer
Using related non-English source languages (e.g., within Mediterranean or Asian groups) yields larger improvements than English-only transfer, validating the importance of lexical overlap
Combining sequence-level and token-level alignment objectives is crucial for addressing different types of downstream tasks (retrieval vs. labeling)

📚 Prerequisite Knowledge

Prerequisites

Multilingual Pre-trained Language Models (mPLMs) like XLM-R
Masked Language Modeling (MLM)
Contrastive Learning (SimCSE)
Transliteration concepts

Key Terms

mPLM: Multilingual Pre-trained Language Model—a model trained on text in many languages to learn shared representations

Script barrier: The performance gap caused by languages using different writing systems (scripts), preventing effective knowledge transfer

Transliteration: Converting text from one script to another (e.g., Cyrillic to Latin) based on phonetic similarity, without translating the meaning

Uroman: A universal romanizer tool that converts text from almost any script into Latin characters

TLM: Translation Language Modeling—an objective usually applied to parallel translation pairs, adapted here for transliteration pairs to align tokens

SimCSE: Simple Contrastive Sentence Embeddings—a framework for learning sentence vectors by pulling similar sentences together and pushing others apart

Glot500: A massively multilingual model pre-trained on over 500 languages, used here as the base model