Assessing the Role of Data Quality in Training Bilingual Language Models

📝 Paper Summary

Multilingual Language Modeling Data Quality Filtering

Unequal data quality between languages, rather than just quantity or the difficulty of multilinguality, drives performance degradation in bilingual models, which can be mitigated by filtering non-English data using English-trained quality classifiers.

Core Problem

Bilingual and multilingual models often exhibit significant performance degradation in high-resource languages (like English) compared to monolingual baselines, a phenomenon often attributed to the 'curse of multilinguality' or insufficient capacity.

Why it matters:

Training separate models for every language is resource-prohibitive in memory-constrained settings
High-quality native data is scarce for many languages, limiting the performance of monolingual models in those languages
Prior work focused on data quantity or model size, overlooking that mixing high-quality English data with lower-quality foreign data dilutes overall model capability

Concrete Example: When a bilingual model is trained on a mix of high-quality English (FineWebEDU) and lower-quality French (mC4) data, English performance drops by ~2% compared to a monolingual English model. However, if the French data is replaced with high-quality translations of English data, this gap disappears.

Key Novelty

Cross-Lingual Quality Filtering (Projection of English Quality Standards)

Demonstrates that the 'curse of multilinguality' is largely a 'curse of data quality inequality'; when data quality is matched (via translation), bilingual penalty vanishes
Proposes using a quality classifier trained *only* on high-quality English data to filter data in other languages (French, German, Chinese) via a multilingual embedding space (SBERT)
Shows that filtering native data in low-resource languages using these English-derived standards improves both monolingual and bilingual performance significantly

Evaluation Highlights

Filtering French data with the proposed method improves monolingual performance by 2–4% and reduces bilingual model performance gaps to within 1%
Bilingual models trained on filtered data outperform public bilingual models like CroissantLLM by 1.7% on zero-shot tasks
Data quality filtering allows a smaller dataset (FineWeb2 10%) to match the performance of highly curated translated data (TransWebEDU) on translated benchmarks

Breakthrough Assessment

7/10

Provides a crucial insight that reframes the 'curse of multilinguality' as a data quality issue. The practical recipe for cross-lingual filtering is highly valuable for low-resource languages, though the reliance on English standards is a known limitation.

⚙️ Technical Details

Problem Definition

Setting: Pretraining decoder-only transformer language models on bilingual corpora (English + Target Language)

Inputs: Raw text documents in English and a target language (French, German, Chinese)

Outputs: A pretrained language model capable of zero-shot tasks in both languages

Pipeline Flow

Data Collection (Raw English + Target Language)
Classifier Training (English only)
Cross-Lingual Filtering (Target Language)
Bilingual Pretraining

System Modules

Quality Classifier (Data Filtering)

Estimate the probability that a document is high quality based on English standards

Model or implementation: Logistic Regression on SBERT embeddings

SBERT Encoder (Data Filtering)

Map text from any language into a shared semantic vector space

Model or implementation: paraphrase-multilingual-MiniLM-L12-v2

Data Selector (Data Filtering)

Select top-k percent of documents based on classifier scores

Model or implementation: Threshold-based selection

Language Model

Autoregressive language modeling

Model or implementation: 1.3B parameter Transformer (Decoder-only)

Novel Architectural Elements

Application of an English-only trained classifier on multilingual SBERT embeddings to filter diverse languages without native high-quality ground truth

Modeling

Base Model: Decoder-only Transformer (1.3B parameters)

Training Data:

English: FineWebEDU, mC4 EN
French: mC4 FR, FineWeb2 FR, RedPajama2 FR
German/Chinese: mC4, RedPajama2, FineWeb2 equivalents
Translated data: TransWebEDU (FineWebEDU translated to target)

Key Hyperparameters:

parameter_count: 1.3B
training_steps: 200,000
batch_size: 1024
+ 2 more
context_length: 1024
vocabulary_size: 256K (PolyLM tokenizer)

Compute: Not reported in the paper

Comparison to Prior Work

vs. CroissantLLM: Uses strict quality filtering based on English standards rather than heuristic filtering or diverse sampling
vs. mBERT/XLM [not cited in paper as baseline but related]: Generative decoder-only architecture vs. encoder-only; focuses on data quality impact rather than architectural capacity
vs. Li et al. (2024): Applies model-based filtering cross-lingually using English seeds, whereas Li et al. focus on English-only filtering

Limitations

Reliance on English notions of 'quality' might bias the model against cultural nuances or formats specific to other languages
Requires a multilingual embedding model (SBERT) that aligns languages well; might fail for low-resource languages not covered by SBERT
Experiments limited to 1.3B parameter models; scaling effects to larger models not fully explored

Reproducibility

Code availability is not explicitly provided. The paper uses public datasets (FineWebEDU, mC4, RedPajama2) and public models for filtering (SBERT paraphrase-multilingual-MiniLM-L12-v2). Proprietary translation systems were used for some experiments (TransWebEDU creation).

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on common sense and knowledge benchmarks

Benchmarks:

MMLU (General knowledge (STEM, Humanities, etc.))
Core Tasks (Common-sense reasoning (ARC-easy, ARC-challenge, SciQ, PIQA, HellaSwag, Winogrande))
CMMLU (Chinese-specific knowledge benchmark)

Metrics:

Zero-shot Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Initial experiments establish that without quality control, bilingual training degrades English performance.
Core Tasks (English)	Accuracy	56.4	53.4	-3.0
Controlled experiments show that matching data quality removes the bilingual performance gap.
MMLU (English)	Accuracy	26.5	26.3	-0.2
Filtering experiments demonstrate that the proposed SBERT-based filter improves performance over baselines.
Core Tasks (French)	Accuracy	46.0	52.0	+6.0
Average (Zero-shot)	Accuracy	Not reported in the paper	Not reported in the paper	+1.7%

Experiment Figures

Heatmap of performance gaps when varying data quality (Low/High) and Language (Monolingual/Bilingual/Mixed)

Performance on Core benchmarks as filtering percentile increases (30%, 60%, 90%)

Main Takeaways

The performance gap between monolingual and bilingual models is driven primarily by data quality discrepancies, not the inherent difficulty of learning multiple languages.
High-quality English data standards can be successfully projected to other languages (French, German, Chinese) using multilingual embeddings to select better training data.
Filtered native data (e.g., FineWeb2 filtered) can match the performance of highly curated translated data (TransWebEDU) on translated benchmarks.
Ideally, data should be high-quality and culturally relevant; high-quality English translations alone are insufficient for culturally specific tasks (e.g., CMMLU).

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture basics
Language model pretraining (scaling laws, tokenization)
Multilingual embedding spaces

Key Terms

FineWebEDU: A high-quality English dataset filtered for educational value and information density

mC4: A massive multilingual dataset from Common Crawl, generally considered lower quality than curated datasets like FineWeb

SBERT: Sentence-BERT—a modification of the BERT network that uses siamese networks to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity

Chinchilla scaling: Optimal training compute allocation rules suggesting a specific ratio of model size to training tokens (roughly 20 tokens per parameter)

TransWebEDU: A version of the FineWebEDU dataset machine-translated into other languages (e.g., French) to create a high-quality parallel corpus

MMLU: Massive Multitask Language Understanding—a benchmark covering 57 subjects across STEM, the humanities, and others

PolyLM tokenizer: A specific tokenizer designed for multilingual models to ensure fair coverage across different languages

curse of multilinguality: The phenomenon where adding more languages to a model of fixed capacity degrades performance on individual languages

DCLM classifier: A fasttext classifier trained to distinguish high-quality data (like OpenHermes) from lower-quality web data