Evaluating Cross-Lingual Unlearning in Multilingual Language Models

📝 Paper Summary

Knowledge Unlearning Multilingual LLMs

Most unlearning methods fail to remove facts across languages, but subspace-projection succeeds by targeting a shared interlingua structure within the model's weight space.

Core Problem

Standard unlearning methods developed for monolingual settings fail to consistently remove factual knowledge across different languages in multilingual models, often leaving residual knowledge in non-target languages or causing collateral damage.

Why it matters:

Multilingual models share semantic subspaces, meaning removing a fact in one language doesn't guarantee its removal in others, risking safety and compliance violations (e.g., GDPR)
Current methods rely on surface-level loss signals that don't generalize to the shared geometric structure of multilingual representations
Incomplete unlearning creates security vulnerabilities where 'forgotten' information can be recovered by querying the model in a different language

Concrete Example: When a fact about a fictional author is unlearned in English using Gradient Ascent, the model may successfully forget it in English but still output the original fact when queried in Spanish or Hindi.

Key Novelty

Subspace-Projection for Cross-Lingual Unlearning

Identifies that multilingual models store facts in a shared 'interlingua' subspace (language-independent) and language-specific subspaces
Uses subspace projection to explicitly remove the directions in weight space corresponding to the shared interlingua, ensuring the fact is inaccessible across all languages
Operates directly on weight geometry rather than just optimizing loss on specific tokens, preventing the 'overfitting' to a single language seen in other methods

Architecture

t-SNE visualization of task-specific subspaces across languages.

Evaluation Highlights

Subspace-projection (UNLEARN) is the only method to achieve statistically significant forgetting (p > 0.1) across all language pairs while maintaining model utility near 1.0
Joint many-to-one unlearning (using multiple languages to unlearn) improves forget quality on held-out languages compared to one-to-one settings
Transliteration experiments show that English unlearning transfers better to romanized Hindi than Devanagari Hindi, confirming script dependence in knowledge storage

Breakthrough Assessment

8/10

First comprehensive evaluation of cross-lingual unlearning revealing fundamental failures in existing methods. Identifies and successfully manipulates the 'interlingua' subspace for robust multilingual forgetting.

⚙️ Technical Details

Problem Definition

Setting: Multilingual Knowledge Unlearning: Given a model M and a fact f in language L_src, update M to M' such that f is forgotten in L_src and all other languages L_target, while preserving other knowledge.

Inputs: A forget set of factual statements in a source language (e.g., English) and a retain set of unrelated facts.

Outputs: An unlearned model M' that refuses to answer the forget query across multiple languages.

Pipeline Flow

Dataset Translation (TOFU to 5 languages)
Unlearning Execution (Apply method on source language)
Cross-Lingual Evaluation (Test on all target languages)

System Modules

Data Preparation

Generate multilingual variants of TOFU facts

Model or implementation: N/A

Unlearning Algorithm

Update model weights to remove specific facts

Model or implementation: Various (Llama 3, Mixtral, Aya-23)

Novel Architectural Elements

Application of subspace discrimination to isolate 'interlingua' vs 'language-specific' weight components for targeted removal

Modeling

Base Model: Llama 3, Mixtral, Aya-23, Llama 2, Mistral, Llama 4, Magistral

Training Method: Various Unlearning Algorithms (Gradient Ascent, NPO, FLAT, UNLEARN)

Objective Functions:

Purpose: Maximize loss on forget data.

Formally: Gradient Ascent maximizes L(forget).
Purpose: Minimize divergence from original model on retain data.

Formally: KL divergence penalty between M and M' on retain set.
Purpose: Discourage generation of forget sequences via preference optimization.

Formally: NPO optimizes negative preference loss.
Purpose: Remove task-specific directions in weight space.

Formally: UNLEARN projects weights orthogonal to the identified task subspace.

Training Data:

Multilingual TOFU dataset
10% forget set size
Translations in English, Chinese, Hindi, Italian, Spanish
Romanized/Transliterated versions for Hindi and Chinese

Key Hyperparameters:

forget_set_size: 10%
statistical_significance_threshold: 0.1

Compute: Not reported in the paper

Comparison to Prior Work

vs. Gradient Ascent: UNLEARN targets weight geometry rather than surface loss, preventing overfitting to source language
vs. NPO: UNLEARN avoids the substantial utility degradation seen in NPO as forget set size increases
vs. Translation-based pipelines [not cited in paper]: UNLEARN modifies internal weights directly rather than relying on translating queries/answers to a pivot language for filtering

Limitations

Non-Latin script languages (Chinese, Hindi) show consistently lower forget quality and utility than Latin scripts
Unlearning is asymmetric: English-to-X is more effective than X-to-English, likely due to English dominance in pretraining
Performance depends heavily on the underlying model's multilingual capabilities (monolingual models fail completely)
Transliteration results indicate that script-specific subspaces still retain some knowledge, preventing perfect interlingua-based unlearning

Reproducibility

Code availability is not explicitly provided in the text. The paper uses the TOFU benchmark which is public, but specific translated datasets and model checkpoints are not linked.

📊 Experiments & Results

Evaluation Setup

Unlearning synthetic facts (TOFU) in one or more languages and evaluating retention in others.

Benchmarks:

Multilingual TOFU (Fictional Knowledge Unlearning) [New]

Metrics:

Forget Quality (p-value > 0.1 indicates success)
Model Utility (harmonic mean of capability metrics, relative to base model)
Statistical methodology: Two-sample p-value comparing unlearned model outputs to original model outputs (TOFU criterion).

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Monolingual unlearning baselines (English-to-English) show that subspace methods outperform classical optimization approaches.
TOFU (English)	Forget Quality (p-value)	< 0.05	> 0.1	Significant improvement
Cross-lingual experiments reveal that only subspace projection maintains utility while achieving forgetting across languages.
Multilingual TOFU	Forget Quality	Fail (p < 0.1) or High Collateral Damage	Success (p > 0.1)	Significant improvement
Transliteration experiments demonstrate that script acts as a barrier to complete cross-lingual unlearning.
TOFU (Hindi/Chinese)	Forget Quality	Lower	Higher	Positive

Experiment Figures

Heatmap or matrix of UNLEARN performance across all language pairs.

Many-to-One unlearning results.

Main Takeaways

Most existing unlearning methods (Gradient Ascent, NPO) fail to generalize across languages, often destroying model utility or failing to forget outside the source language.
A shared 'interlingua' subspace exists in multilingual models; targeting this subspace via projection allows for effective cross-lingual forgetting.
Cross-lingual unlearning is asymmetric: forgetting in the dominant language (English) transfers better to others than vice-versa.
Script and tokenization create language-specific subspaces that resist unlearning if not explicitly targeted (e.g., via transliteration or many-to-one training).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model fine-tuning and unlearning
Familiarity with multilingual representation learning (interlingua)
Linear algebra concepts (subspaces, projections)

Key Terms

Interlingua: A shared, language-independent semantic subspace within a multilingual model's weights where concepts align across languages

Subspace-projection: An unlearning technique that identifies the low-dimensional geometric direction of a specific task or fact in weight space and removes it via orthogonal projection

TOFU: Task of Fictional Unlearning—a benchmark dataset using synthetic authors and facts to test whether models can unlearn specific information without collateral damage

Gradient Ascent: A basic unlearning method that updates model weights to maximize the loss (likelihood of error) on the specific data to be forgotten

NPO: Negative Preference Optimization—an unlearning method that treats the forget data as a 'rejected' sample in a preference optimization framework to discourage its generation

Collateral degradation: Unintended damage to the model's general capabilities or unrelated knowledge during the unlearning process

SISA: Sharded, Isolated, Sliced, and Aggregated—an unlearning framework that retrains models on subsets of data to make forgetting easier (mentioned as a baseline type)

KL divergence: A statistical measure of how one probability distribution differs from a second, reference probability distribution