← Back to Paper List

The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?

C Shani, Y Reif, N Roll, D Jurafsky, E Shutova
Stanford University, The Hebrew University of Jerusalem, University of Amsterdam
arXiv, 1/2026 (2026)
Pretraining Benchmark

📝 Paper Summary

Multilingual NLP Model evaluation and analysis
Performance gaps in multilingual language models stem primarily from engineering choices like tokenization and data allocation rather than intrinsic linguistic difficulty, and these gaps shrink when design artifacts are normalized.
Core Problem
Multilingual language models exhibit systematic performance disparities where high-resource and Latin-script languages consistently outperform low-resource and typologically distant ones.
Why it matters:
  • Current disparities limit the practical utility of AI for billions of speakers of non-dominant languages.
  • Scaling alone does not resolve these inequities; larger models often preserve or amplify gaps rooted in tokenization and data sampling.
  • Misinterpreting engineering artifacts (like tokenizer fragmentation) as intrinsic linguistic difficulty prevents the development of truly equitable multilingual systems.
Concrete Example: Due to UTF-8 byte premiums, a Chinese character might require 3 bytes while a Latin character requires 1. Under a fixed token budget, a Chinese model effectively sees far less semantic content than an English model, leading to unfair comparisons and poorer performance not because Chinese is 'harder' to model, but because the encoding is inefficient.
Key Novelty
Systematic Synthesis of Modeling Artifacts vs. Intrinsic Difficulty
  • Analyzes six linguistic properties (orthography, morphology, lexical diversity, syntax, information density, typology) to decouple inherent learnability from modeling artifacts.
  • Identifies that 'difficulty' is often an interaction effect: what looks like morphological complexity is actually tokenizer fragmentation causing data sparsity.
  • Proposes a causal framework linking specific design choices (encoding, sampling, capacity allocation) to observed performance gaps.
Evaluation Highlights
  • Morphology-aware segmentation substantially reduces surprisal gaps between agglutinative and fusional languages compared to standard BPE.
  • Normalizing for byte-length and tokenization removes spurious correlations between morphological typology and language model performance.
  • Modular capacity allocation reduces negative transfer (interference) when typological diversity exceeds the model's effective capacity.
Breakthrough Assessment
9/10
A comprehensive foundational survey that reframes the entire field of multilingual modeling. It shifts the burden of proof from 'linguistic difficulty' to 'engineering fairness', offering concrete design recommendations.
×