← Back to Paper List

Assessing the Role of Data Quality in Training Bilingual Language Models

Unknown authors
Apple
arXiv
Pretraining Benchmark

📝 Paper Summary

Multilingual Language Modeling Data Quality Filtering
Unequal data quality between languages, rather than just quantity or the difficulty of multilinguality, drives performance degradation in bilingual models, which can be mitigated by filtering non-English data using English-trained quality classifiers.
Core Problem
Bilingual and multilingual models often exhibit significant performance degradation in high-resource languages (like English) compared to monolingual baselines, a phenomenon often attributed to the 'curse of multilinguality' or insufficient capacity.
Why it matters:
  • Training separate models for every language is resource-prohibitive in memory-constrained settings
  • High-quality native data is scarce for many languages, limiting the performance of monolingual models in those languages
  • Prior work focused on data quantity or model size, overlooking that mixing high-quality English data with lower-quality foreign data dilutes overall model capability
Concrete Example: When a bilingual model is trained on a mix of high-quality English (FineWebEDU) and lower-quality French (mC4) data, English performance drops by ~2% compared to a monolingual English model. However, if the French data is replaced with high-quality translations of English data, this gap disappears.
Key Novelty
Cross-Lingual Quality Filtering (Projection of English Quality Standards)
  • Demonstrates that the 'curse of multilinguality' is largely a 'curse of data quality inequality'; when data quality is matched (via translation), bilingual penalty vanishes
  • Proposes using a quality classifier trained *only* on high-quality English data to filter data in other languages (French, German, Chinese) via a multilingual embedding space (SBERT)
  • Shows that filtering native data in low-resource languages using these English-derived standards improves both monolingual and bilingual performance significantly
Evaluation Highlights
  • Filtering French data with the proposed method improves monolingual performance by 2–4% and reduces bilingual model performance gaps to within 1%
  • Bilingual models trained on filtered data outperform public bilingual models like CroissantLLM by 1.7% on zero-shot tasks
  • Data quality filtering allows a smaller dataset (FineWeb2 10%) to match the performance of highly curated translated data (TransWebEDU) on translated benchmarks
Breakthrough Assessment
7/10
Provides a crucial insight that reframes the 'curse of multilinguality' as a data quality issue. The practical recipe for cross-lingual filtering is highly valuable for low-resource languages, though the reliance on English standards is a known limitation.
×