QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining

📝 Paper Summary

Data Selection for Pretraining LLM Pretraining Efficiency

QuaDMix is a data selection framework that jointly optimizes quality and diversity by using a parameterized sampling function tuned via proxy models to predict downstream performance.

Core Problem

Existing data selection methods optimize quality and diversity separately (e.g., filtering then mixing), overlooking their inherent trade-off and interplay, which leads to suboptimal pretraining efficiency.

Why it matters:

High-quality data is limited, and aggressive filtering can reduce diversity, hurting model generalization.
Different quality criteria have biases that skew domain distributions, meaning optimal mixtures depend on the quality filters used.
Current approaches rely on heuristics or manual tuning for mixing ratios, which is inefficient and scales poorly.

Concrete Example: Choosing a strict quality filter (like educational value) might inadvertently filter out most 'Sports' or 'Entertainment' data, skewing the distribution. Independently optimizing for 'quality' (filtering) and then 'diversity' (mixing) fails to account for how the filter itself altered the domain composition.

Key Novelty

Unified Parameterized Sampling for Quality-Diversity Balance

Defines a sampling function that assigns probabilities based on a weighted combination of multiple quality scores and domain labels, rather than hard filtering.
Uses a two-step optimization: first training many small proxy models (1M parameters) to gather performance data, then training a regressor to predict the performance of unseen sampling parameters.
Optimizes parameters specifically for target downstream tasks by using those tasks' training data as the validation set for the proxy models.

Architecture

The QuaDMix pipeline: Feature Extraction -> Parameterized Sampling -> Proxy Model Training -> Regression -> Optimal Parameter Search -> Large Scale Training.

Evaluation Highlights

Achieves an average performance improvement of 7.2% across multiple benchmarks (including MMLU, HellaSwag, ARC) compared to random selection.
Outperforms independent strategies like RegMix (diversity only) and AskLLM/Fineweb-edu (quality only) on an aggregated benchmark of 9 tasks.
Demonstrates that task-specific optimization (QuaDMix-BMK) further boosts performance by using downstream task data as the validation target.

Breakthrough Assessment

7/10

Solid contribution addressing the specific interaction between quality and diversity. The use of very small proxy models (1M params) to tune 530M models is efficient, though the scale of the final evaluation (530M) is relatively small compared to modern standards.

⚙️ Technical Details

Problem Definition

Setting: LLM Pretraining Data Selection

Inputs: Raw pretraining corpus X, set of quality scorers, domain classifier

Outputs: Sampled dataset D optimized for downstream performance

Pipeline Flow

Feature Extraction (Quality & Domain Labeling)
Parameter Sampling & Proxy Dataset Generation
Proxy Model Training & Evaluation
Regression Model Fitting
Optimal Parameter Search & Large-Scale Sampling

System Modules

Feature Extractor

Label every document with domain and quality scores

Model or implementation: Various (Deberta V3 for domains, AskLLM/Fineweb-Edu/DCLM for quality)

Sampler

Determine sampling frequency for each document based on parameters

Model or implementation: Parameterized Sigmoid Function

Regressor

Predict validation loss given sampling parameters

Model or implementation: LightGBM

Novel Architectural Elements

Unified sampling function S(x, q, d; theta) that integrates multiple quality scores with domain-specific weights and thresholds into a single probability
Optimization loop using 1M-parameter proxy models to tune data selection for 530M-parameter models

Modeling

Base Model: Transformer (Decoder-only)

Training Method: Pretraining from scratch

Training Data:

Source: RefinedWeb (570B tokens)
Proxy Training: 1B tokens per run (3000 runs)
Final Training: 500B tokens

Key Hyperparameters:

proxy_model_params: 1M
final_model_params: 530M
proxy_training_tokens: 1B
+ 3 more
final_training_tokens: 500B
proxy_compute: 1 NVIDIA H100 GPU hour per run
final_compute: 32 NVIDIA GPUs for 3 days

Compute: 3000 H100-hours for proxy search + 32 GPUs * 3 days for final model

Comparison to Prior Work

vs. RegMix: QuaDMix optimizes quality thresholds and merging weights jointly with domain mixture, whereas RegMix only optimizes domain weights.
vs. AskLLM/Fineweb-Edu: QuaDMix balances multiple quality criteria and diversity, rather than relying on a single fixed quality threshold.
vs. DSIR: QuaDMix uses a learned parameterized function guided by proxy model performance, whereas DSIR relies on matching a target distribution via importance weights.

Limitations

Proxy models are very small (1M parameters) compared to target models (530M), which may limit the fidelity of performance ranking.
Requires training thousands of proxy models, which is computationally intensive despite the small model size.
Sorting quality scores across the entire dataset for ranking is computationally expensive; requires estimation via subsampling.

Reproducibility

Code availability is not provided. The method relies on specific proprietary or external quality filters (AskLLM, Fineweb-Edu) and datasets (RefinedWeb). Implementation details for the parameterized sampling function and regression features are described mathematically.

📊 Experiments & Results

Evaluation Setup

Pretraining 530M parameter language models from scratch on 500B tokens

Benchmarks:

Aggregated Benchmark (Various (Commonsense, Reading Comprehension, Math, Knowledge))
HellaSwag (Commonsense Reasoning)
MMLU (Knowledge Intensive)
ARC-E/C (Reasoning)

Metrics:

Normalized Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
QuaDMix outperforms baselines on the aggregated benchmark (average of 9 tasks) and specifically when optimized for those tasks.
Aggregated Benchmark (Avg)	Normalized Accuracy	44.6	47.8	+3.2
Aggregated Benchmark (Avg)	Normalized Accuracy	46.1	47.8	+1.7
Aggregated Benchmark (Avg)	Normalized Accuracy	47.2	47.8	+0.6
Aggregated Benchmark (Avg)	Normalized Accuracy	47.8	48.2	+0.4

Experiment Figures

Validation of the regression model. Left: Correlation between predicted loss and real proxy model loss. Right: Comparison of different regression models (LightGBM vs SVR) across training set sizes.

Analysis of optimal parameters found by QuaDMix-BMK. Left: Domain weight changes relative to natural distribution. Right: Weights assigned to different quality filters (DCLM, Fineweb, AskLLM).

Main Takeaways

Jointly optimizing quality and diversity (QuaDMix) consistently outperforms optimizing them in isolation (RegMix for diversity, various filters for quality).
Different quality criteria have trade-offs; merging them with learned weights allows the model to leverage complementary information.
The optimal data mixture shifts significantly depending on the quality criteria used, validating the need for joint optimization.
The regression target (validation set for proxy models) effectively steers the data selection towards specific downstream tasks (e.g., QuaDMix-BMK boosts performance on targeted benchmarks).

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM pretraining pipelines
Familiarity with data quality filtering (perplexity, classifiers)
Knowledge of scaling laws and proxy model methodologies

Key Terms

QuaDMix: The proposed framework that parameterizes data sampling probabilities based on quality and domain to jointly optimize them.

Proxy Model: A small model (e.g., 1M parameters) trained to estimate the performance of larger models, allowing for cheap exploration of hyperparameters.

LightGBM: A gradient boosting framework that uses tree-based learning algorithms, used here as a regressor to predict model loss from sampling parameters.

RefinedWeb: A large-scale English web dataset used as the source corpus for pretraining experiments.

RegMix: A baseline method that optimizes data mixtures (diversity) using regression on proxy model results but does not jointly optimize quality filtering.

AskLLM: A quality filtering method that uses a prompted LLM to score data quality.

SwiGLU: A widely used activation function in LLMs, a variant of GLU (Gated Linear Unit).

RoPE: Rotary Positional Embeddings, a method for encoding position information in transformer models.