Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

📝 Paper Summary

Data Curation Multilingual NLP

JQL is a systematic pipeline for filtering multilingual pre-training data by distilling the quality judgments of large LLMs into lightweight, embedding-based regressors that generalize across languages.

Core Problem

High-quality multilingual data is scarce, and existing filtering methods rely on heuristics or English-centric classifiers that fail to scale or generalize to low-resource languages.

Why it matters:

Pre-training data quality is a primary factor in LLM performance and training efficiency
Current state-of-the-art curation strategies are often closed-source, hindering reproducibility
Heuristic filters (like length or keyword checks) often discard high-quality content in low-resource languages or retain low-quality noise

Concrete Example: A standard heuristic filter might reject a short but highly educational math tutorial in Basque because it fails a length threshold or lacks specific keywords, whereas JQL's embedding-based model correctly identifies its educational value by projecting it into a shared semantic space.

Key Novelty

Judging Quality across Languages (JQL)

Distills the quality-judging capabilities of large 'teacher' LLMs (like Llama-3-70B) into lightweight 'student' regressors built on top of pre-trained multilingual text embeddings
Uses a fixed cross-lingual embedding space (Snowflake Arctic) to enable zero-shot transfer, allowing the lightweight model to judge quality even in languages it wasn't explicitly trained on

Architecture

The four-stage JQL pipeline: (1) Human Annotation, (2) LLM Evaluation, (3) Distillation into Lightweight Annotators, (4) Filtering.

Evaluation Highlights

Retains >9% more tokens than the Fineweb2 heuristic baseline for Spanish while achieving higher downstream model performance
JQL-filtered data consistently outperforms Fineweb2 heuristic baselines across 13 diverse European languages on MMLU, HellaSwag, and ARC benchmarks
Lightweight annotators achieve Spearman correlation >0.87 with teacher LLMs, successfully preserving the relative ranking of document quality at a fraction of the compute cost

Breakthrough Assessment

8/10

Provides a reproducible, open-source recipe for high-quality multilingual data curation, significantly outperforming heuristics and addressing the critical data scarcity bottleneck for non-English languages.

⚙️ Technical Details

Problem Definition

Setting: Multilingual document quality estimation and filtering for LLM pre-training corpora

Inputs: Raw multilingual web document text D

Outputs: Quality score S (continuous value) predicting educational value

Pipeline Flow

Human Annotation (Ground Truth Collection)
Teacher LLM Selection & Scoring
Lightweight Annotator Training (Distillation)
Filtering & Thresholding

System Modules

Human Annotator Pool

Create ground truth for evaluating teacher models

Model or implementation: Human experts

Teacher LLM

Generate synthetic training labels for the lightweight model

Model or implementation: Gemma-3-27B-it, Mistral-3.1-24B-it, or Llama-3.3-70B-it

Embedding Backbone (Distillation)

Convert text into multilingual semantic vectors

Model or implementation: Snowflake Arctic Embed v2.0 (frozen)

Regression Head (Distillation)

Predict quality score from embeddings

Model or implementation: Simple MLP with ReLU activation

Novel Architectural Elements

Decoupled filtering architecture: Frozen cross-lingual embedding backbone + lightweight regression heads allows efficiently switching filtering criteria (e.g., educational value vs. toxicity) without re-encoding documents

Modeling

Base Model: Lightweight regressor on Snowflake Arctic Embed v2.0

Training Method: Supervised regression (Knowledge Distillation)

Objective Functions:

Purpose: Minimize error between student prediction and teacher score.

Formally: Standard regression loss (likely MSE, though not explicitly formulated in text) on synthetic teacher labels.

Adaptation: Training only the MLP head while keeping embedding backbone frozen

Training Data:

500k documents sampled per language from Fineweb2
Synthetic labels generated by Teacher LLMs (Gemma, Mistral, Llama)
Two variants: natural label distribution vs. balanced distribution (oversampling rare high scores)

Key Hyperparameters:

training_sample_size: 500,000 documents across 35 languages

Compute: Inference throughput: ~11,000 annotations per minute on a single A100 GPU (average 690 tokens/doc)

Comparison to Prior Work

vs. Fineweb-Edu: Extends the approach to 35 languages using cross-lingual embeddings instead of just English
vs. Fineweb2 (Heuristic): Replaces static rules with learned semantic quality signals, retaining more high-quality data
vs. QuRating: Focuses specifically on diverse multilingual transfer via embedding backbones rather than just English/monolingual scoring [not cited in paper]

Limitations

Relies on 'Teacher' LLMs which may hallucinate or have biases in low-resource languages
Performance drops for linguistically isolated languages (Irish, Maltese, Basque) not well-represented in the embedding model
Downstream evaluation limited to 2B parameter models and 27B tokens due to compute constraints

Reproducibility

Code: https://huggingface.co/spaces/JQL-AI/JQL

Artifacts released: Ground truth dataset (511 docs x 35 languages), 14M synthetic annotations, code, and JQL lightweight models on HuggingFace. Missing: Exact hyperparameters for the ablation model pre-training (learning rates, schedules) are referenced in Appendix D.1 but not fully detailed in main text.

📊 Experiments & Results

Evaluation Setup

Pre-train 2B parameter decoder-only models on filtered datasets and evaluate on downstream benchmarks

Benchmarks:

MMLU (Multilingual) (General knowledge & reasoning)
HellaSwag (Multilingual) (Commonsense reasoning)
ARC (Multilingual) (Science reasoning)

Metrics:

Token-normalized probability of correct answer
Spearman correlation (for annotator agreement)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on downstream model performance comparing JQL filtering against the Fineweb2 (FW2) heuristic baseline across aggregated languages.
MMLU	Average Score	0.2694	0.2727	+0.0033
HellaSwag	Average Score	0.2974	0.3021	+0.0047
ARC	Average Score	0.2605	0.2655	+0.0050
Spanish Corpus	Retained Tokens %	100	109	+9
Ground Truth Correlation	Spearman Correlation	0.55	0.68	+0.13

Experiment Figures

Heatmap of cross-lingual performance (Spearman correlation) for lightweight annotators trained on specific languages vs. evaluated on others.

Learning curves (MMLU/HellaSwag/ARC) for Spanish models trained on FW2 Heuristic vs. JQL filtered data.

Main Takeaways

JQL filtering consistently improves downstream model performance over heuristic baselines across 13 languages.
Distilled lightweight annotators are highly efficient and maintain high rank correlation (>0.87) with much larger teacher models.
The method allows for 'softer' filtering: it can retain significantly more data (e.g., +9% in Spanish) while still improving quality, addressing the data scarcity issue in multilingual training.
Cross-lingual transfer is robust: models trained on one language family generalize well to others, largely thanks to the aligned embedding space.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM pre-training pipelines
Knowledge of knowledge distillation
Familiarity with text embeddings

Key Terms

JQL: Judging Quality across Languages—the proposed pipeline for multilingual data filtering

Fineweb-Edu: A dataset and filtering approach using LLM-generated educational quality scores, originally for English

Fineweb2: A large-scale multilingual web dataset (FW2) used as the raw source for experiments

Distillation: Training a smaller 'student' model to mimic the outputs of a larger 'teacher' model

Cross-lingual transfer: The ability of a model trained on one language to perform tasks in another language without explicit retraining

Spearman correlation: A statistical measure of rank correlation, assessing how well the relationship between two variables can be described using a monotonic function

MMLU: Massive Multitask Language Understanding—a benchmark measuring knowledge across 57 subjects

HellaSwag: A benchmark for commonsense reasoning

ARC: AI2 Reasoning Challenge—a benchmark for grade-school science questions

WARC: Web ARChive file format, commonly used to store web crawls