Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration

📝 Paper Summary

Data Selection Data Curation

The paper proposes a multi-actor framework where independent data selection 'actors' (Quality, Domain, Topic) collaborate via a central console that dynamically adjusts their influence based on model feedback, optimizing data efficiency and performance.

Core Problem

Existing data selection methods (quality filtering, domain mixing, influence functions) often conflict with each other (e.g., high-quality data might have low topic diversity or low model influence), and integrating them naively is inefficient or leads to suboptimal results.

Why it matters:

Data quality significantly impacts LM performance and training efficiency.
Online selection methods like MATES or DSDM are computationally expensive (require relabeling entire datasets frequently).
Static heuristics don't adapt to the model's evolving state during training.

Key Novelty

Multi-Actor Collaborative Data Selection

Decomposes data selection into separate 'Actors' (e.g., Quality Actor, Domain Actor, Topic Actor), each maintaining its own memory and scoring logic.
Uses an 'Actor Console' to dynamically aggregate scores from actors and adjust their weights (collaboration) based on reward signals (influence functions on reference tasks) from the current model state.
Combines offline labeling (cheap) with online weight updates (adaptive) to balance efficiency and performance.

Architecture

The multi-actor collaborative framework. It shows the offline labeling phase and the online loop where the Actor Console aggregates scores from Quality, Domain, and Topic actors, gets feedback from the model, and updates actor weights.

Evaluation Highlights

Achieves up to 10.5% relative performance gain on average across benchmarks compared to state-of-the-art baselines (including MATES, DoReMi, QuRating).
Significantly improves data efficiency: A 30B token run outperforms a 60B token random sampling baseline.
Reduces computational cost: 1/7th the FLOPs of QuRating and half the FLOPs of MATES.

Breakthrough Assessment

7/10

The method effectively solves the conflict between different data selection heuristics and scales well. The performance gains (+10.5% avg) and efficiency improvements over strong baselines like MATES make it a significant contribution to pretraining science.

⚙️ Technical Details

Pipeline Flow

Offline Labeling: Label entire corpus with Quality, Domain, and Topic metadata.
Initialization: Initialize actor weights using regression on small proxy models.
Training Loop:
1. Actors sample data and score it based on internal weights.
2. Current Model computes reward (influence function on reference tasks) for sampled data.
3. Actors update internal weights (intra-actor learning).
4. Actor Console updates collaboration weights (inter-actor learning) based on actor rewards.
5. Select top-k data for next training stage based on aggregated scores.
6. Train main model on selected data.

System Modules

Quality Actor

Prioritizes data based on quality scores (FineWeb-Edu).

Model or implementation: BERT-based regressor (FineWeb-Edu scorer)

Domain Actor

Prioritizes data based on source domain.

Model or implementation: Metadata lookup

Topic Actor

Prioritizes data based on semantic topics.

Model or implementation: BERT-based topic classifier (trained on GPT-4o annotated clusters)

Actor Console

Aggregates scores and manages collaboration.

Model or implementation: N/A (Optimization logic)

📊 Experiments & Results

Evaluation Setup

Pretrain 1.3B models on 30B tokens selected from SlimPajama. Compare against baselines on downstream tasks.

Benchmarks:

MMLU, ARC, MathQA (Problem Solving)
SIQA, WinoGrande, OBQA, CSQA (Commonsense Reasoning)
RACE, BoolQ (Reading Comprehension)

Metrics:

Average Accuracy (0-shot, 3-shot, 5-shot)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Average (10 tasks)	Accuracy	34.2	37.8	+3.6
Average (10 tasks)	Accuracy	35.6	37.8	+2.2
Average (10 tasks)	Accuracy	35.3	37.8	+2.5
Average (10 tasks)	Accuracy	36.1	37.8	+1.7

Experiment Figures

3D bar chart showing inherent conflicts in data attributes (Quality vs Domain vs Topic Diversity vs Influence), motivating the need for multi-actor collaboration.

Downstream performance vs pretraining steps. The proposed method consistently outperforms baselines throughout the training process.

Main Takeaways

Dynamic collaboration between Quality, Domain, and Topic actors yields better data selection than any single method alone.
Online updates based on influence functions allow the data distribution to shift according to the model's learning needs (e.g., shifting domain weights).
The method is computationally efficient by separating offline labeling from lightweight online weight updates.