Improving In-context Learning of Multilingual Generative Language Models with Cross-lingual Alignment

📝 Paper Summary

Multilingual Large Language Models Cross-lingual In-context Learning

AFP aligns multilingual generative models by pulling internal representations of translation pairs closer and enforcing output alignment via cross-lingual instruction following to improve cross-lingual transfer.

Core Problem

Multilingual generative models exhibit a performance bias toward high-resource languages and maintain isolated internal representation distributions for different languages, hindering effective knowledge transfer.

Why it matters:

High-resource bias leaves low-resource languages with significantly worse performance (e.g., 27.5% gap between English and Telugu in GPT-4)
Isolated representation clusters mean models learn languages separately rather than transferring concepts, limiting the utility of multilingual pre-training
Re-training massive models to fix these biases is computationally prohibitive; lightweight alignment methods are needed

Concrete Example: When visualizing sentence representations in XGLM, English and Chinese sentences with the same meaning form two distinct, separated clusters. This separation prevents the model from effectively using English knowledge to answer Chinese prompts in zero-shot settings.

Key Novelty

Align a Fter Pre-training (AFP)

Aligns internal model states by treating translation pairs as positive examples in contrastive learning, pulling their vector representations together within the decoder
Aligns model outputs by forcing the model to generate responses in a target language given a source language context (Cross-lingual Instruction Following), rather than just matching the input language

Architecture

The AFP framework illustrating the two alignment modules: Multilingual Contrastive Learning (MCL) and Cross-lingual Instruction Following (CIF).

Evaluation Highlights

Reduces the relative performance gap between English and Chinese on XNLI by 6.53% for XGLM 564M using <1M parallel samples
Improves average performance on 5 multilingual tasks (NLI, Reasoning, Paraphrase) by 2.6% across 52 languages
Boosts zero-shot translation performance from 27.3 to 61.2 COMET score on average for XGLM models

Breakthrough Assessment

7/10

Effective lightweight framework that significantly boosts cross-lingual transfer with minimal data (<0.1‰ of pre-training tokens). While not a new architecture, it solves a critical alignment problem efficiently.

⚙️ Technical Details

Problem Definition

Setting: Aligning a pre-trained multilingual generative model f(θ) using a small set of parallel sentence pairs and instructions

Inputs: Parallel sentence pairs (source, target) and instruction-response pairs

Outputs: An aligned model f'(θ) with improved cross-lingual in-context learning capabilities

Pipeline Flow

Input Translation Pairs → Multilingual Contrastive Learning (MCL) on Decoder Layer
Input Instruction Pairs → Cross-lingual Instruction Following (CIF) on Output Generation

System Modules

Multilingual Contrastive Learning (MCL)

Aligns internal sentence representations across languages

Model or implementation: Applies to the first transformer layer after embedding (determined empirically)

Cross-lingual Instruction Following (CIF)

Aligns outputs by requiring response generation in a specific target language

Model or implementation: Full decoder generation

Novel Architectural Elements

Application of contrastive learning specifically to the internal layers of a decoder-only generative model for cross-lingual alignment (MCL)
Cross-lingual instruction following (CIF) objective where the model must switch languages between context and response, enforcing semantic transfer

Modeling

Base Model: XGLM (564M, 7.5B), BLOOM (560M, 1.7B, 7.1B), Llama-7B

Training Method: Joint optimization of Contrastive Learning and Causal Language Modeling

Objective Functions:

Purpose: Pull representations of translation pairs closer.

Formally: L_MCL = -log(exp(sim(h_i, h_i+)/τ) / Σ exp(sim(h_i, h_j)/τ))
Purpose: Train model to follow instructions and switch languages.

Formally: L_CIF = -Σ log P(r_b | c_a -> b, <j)
Purpose: Combine objectives.

Formally: L_AFP = L_MCL + α * L_CIF

Training Data:

Bactrian-X (multilingual instruction tuning dataset, 52 languages)
OPUS-100 (multilingual machine translation dataset, 100k samples used)

Key Hyperparameters:

learning_rate: 1e-5
optimizer: AdamW (β1=0.9, β2=0.999)
temperature_tau: 0.05
+ 4 more
alpha: 1.5 (balance between MCL and CIF)
p_src: 0.5 (probability target language same as source)
batch_size: 128
training_steps: 10k

Compute: 8x A100 80GB GPUs

Comparison to Prior Work

vs. MIT: AFP forces cross-lingual output generation (CIF) and explicitly aligns internal representations (MCL), whereas MIT only does same-language generation
vs. BLOOMZ: AFP achieves better performance with significantly less data (<1M samples vs 78M instructions for BLOOMZ)
vs. Semantic Alignment: AFP aligns the model weights themselves, whereas Semantic Alignment only aligns the prompt context [not cited in paper as direct baseline, but method acts as complementary]

Limitations

Relies on labeled parallel training data, which is unavailable for extremely low-resource languages
Constrained to models ≤ 7.5B parameters due to compute resources
Potential error propagation from machine translation systems used to create the instruction dataset (Bactrian-X)
Risk of inheriting cultural biases from English due to English-centric pivot alignment

Reproducibility

Code: https://github.com/chongli17/CrossLingualAlignment

publicly available (https://github.com/chongli17/CrossLingualAlignment). Code and method are released. Parallel corpora (Bactrian-X, OPUS-100) are public.

📊 Experiments & Results

Evaluation Setup

Few-shot (0/5-shot) In-context Learning across multilingual benchmarks

Benchmarks:

XNLI (Natural Language Inference)
PAWS-X (Paraphrase Detection)
XCOPA (Causal Reasoning)
XStoryCloze (Commonsense Reasoning)
FLORES-101 (Machine Translation)

Metrics:

Accuracy
COMET (for translation)
spBLEU (for translation)
Alignment and Uniformity scores (for representation analysis)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Bilingual alignment (English-Chinese) results showing AFP improves over base models and standard instruction tuning.
XNLI (Average EN/ZH)	Accuracy	41.55	47.30	+5.75
XCOPA (Average EN/ZH)	Accuracy	54.35	57.70	+3.35
XNLI + XCOPA (Average)	Accuracy	46.1	50.7	+4.6
FLORES-101	COMET	26.0	59.3	+33.3
5 Datasets Avg	0-shot Accuracy	52.94	55.97	+3.03

Experiment Figures

t-SNE visualization of internal sentence representations for English and Chinese before and after alignment.

Evolution of Alignment and Uniformity metrics during training for different methods.

Main Takeaways

AFP significantly boosts cross-lingual ability using <1M parallel samples, outperforming standard multilingual instruction tuning.
Internal representation analysis (Alignment and Uniformity metrics) confirms AFP creates better-aligned multilingual distributions compared to vanilla models or instruction tuning.
Applying Multilingual Contrastive Learning (MCL) to the *first* transformer layer yields the best results, contrary to encoder-based approaches that often use later layers.
Cross-lingual Instruction Following (CIF) is more effective than Multilingual Instruction Tuning (MIT) because it forces the model to bridge languages during generation.
The method generalizes to languages unseen during pre-training (e.g., Llama on Chinese, BLOOM on Thai/Turkish).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer decoder architectures
Familiarity with Contrastive Learning objectives
Basic knowledge of In-context Learning (ICL) and Instruction Tuning

Key Terms

AFP: Align a Fter Pre-training—the proposed framework combining contrastive learning and cross-lingual instruction following

MCL: Multilingual Contrastive Learning—aligning internal representations by pulling translation pairs closer in vector space

CIF: Cross-lingual Instruction Following—training the model to generate responses in a target language different from the prompt's source language

MIT: Multilingual Instruction Tuning—standard tuning where prompt and response are in the same language

COMET: A neural framework for evaluating machine translation quality

alignment: The degree to which representations of semantically similar pairs are close in vector space

uniformity: The degree to which representations are uniformly distributed on the hypersphere

mean pooling: A method to aggregate token representations into a single sentence representation by averaging them