← Back to Paper List

Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets

Sultan Alrashed, Francesco Orabona
arXiv (2025)
Pretraining Benchmark

📝 Paper Summary

Multilingual Pretraining Data Dataset Curation Deduplication
MixMinMatch treats redundancy across independent web crawls as a quality signal, filtering pretraining data by retaining only documents that appear in at least two independent source corpora.
Core Problem
Independent research groups waste resources re-scraping the same web content, and identifying high-quality data typically requires expensive model-based filtering or brittle language-specific heuristics.
Why it matters:
  • Repeated crawling of the same content consumes substantial computational and storage resources without adding diversity
  • Standard quality filters (like C4's English-centric heuristics) often discard high-quality text in morphologically rich or non-Latin script languages
  • Model-based quality filtering (e.g., using classifiers) is computationally expensive, requiring inference passes over billions of documents
Concrete Example: ArabicWeb24 contains unique high-quality content that standard pipelines drop, but simpler pipelines might retain noise. By checking if a document like a news article appears in both C4 and HPLT, MixMinMatch confirms its value without needing an Arabic-specific quality classifier.
Key Novelty
MixMinMatch (Ensemble Filtering via Deduplication)
  • Aggregates multiple datasets while preserving source tags, treating each dataset's inclusion of a document as an independent 'vote' for its quality
  • Leverages standard MinHash deduplication clusters to count unique sources; documents appearing in $\ge 2$ sources are retained as high-confidence data
  • Extracts a strong quality signal essentially 'for free' as a byproduct of the mandatory deduplication step, removing the need for expensive inference-based filtering
Evaluation Highlights
  • AraMix-Matched (the proposed cross-source subset) outperforms the strongest single-source baseline, ArabicWeb24 (0.161 vs. 0.154 aggregate score)
  • Cross-source agreement identifies tens of billions of high-quality tokens; e.g., C4 and CulturaX share over 9 billion tokens of near-duplicate Arabic content
  • ArabicWeb24 achieves a 93.8% survival rate after cross-dataset deduplication, indicating its pipeline successfully captures unique, high-quality content missed by others
Breakthrough Assessment
7/10
Offers a clever, compute-efficient heuristic for data filtering that turns the 'waste' of redundant crawling into a feature. While conceptually simple, the demonstrated efficiency gains and performance parity/improvement over complex baselines are significant for multilingual LLM training.
×