JQL: Judging Quality across Languages—the proposed pipeline for multilingual data filtering
Fineweb-Edu: A dataset and filtering approach using LLM-generated educational quality scores, originally for English
Fineweb2: A large-scale multilingual web dataset (FW2) used as the raw source for experiments
Distillation: Training a smaller 'student' model to mimic the outputs of a larger 'teacher' model
Cross-lingual transfer: The ability of a model trained on one language to perform tasks in another language without explicit retraining
Spearman correlation: A statistical measure of rank correlation, assessing how well the relationship between two variables can be described using a monotonic function
MMLU: Massive Multitask Language Understanding—a benchmark measuring knowledge across 57 subjects
HellaSwag: A benchmark for commonsense reasoning
ARC: AI2 Reasoning Challenge—a benchmark for grade-school science questions
WARC: Web ARChive file format, commonly used to store web crawls