Unsupervised Topic Models are Data Mixers for Pre-training Language Models

📝 Paper Summary

LLM Pre-training Data Curation Data Mixing

Organizing pre-training data by semantic topics rather than data sources consistently improves LLM performance across multiple mixing algorithms by creating a better optimization landscape.

Core Problem

Current data mixing strategies rely on coarse 'sources' (e.g., CommonCrawl, GitHub) which contain heterogeneous topics, failing to capture semantic nuances needed for optimal training.

Why it matters:

Single sources like CommonCrawl contain diverse topics (Science, Politics) with varying relevance to downstream tasks, making source-level mixing inefficient.
Modern web-crawled datasets (FineWeb, DCLM) often lack meaningful source divisions, rendering source-based mixing obsolete.
Existing semantic sorting methods either require heavy human supervision (WebOrganizer) or lack scalability/labeling (unsupervised clustering).

Concrete Example: A 'Science' topic appears across arXiv (high quality) and CommonCrawl (noisy), while CommonCrawl also contains 'Entertainment'. Source-based mixing treats all CommonCrawl data identically, preventing the model from specifically upweighting high-value scientific content regardless of where it comes from.

Key Novelty

Topic-based Data Mixing via Scalable Taxonomy

Replace source labels (e.g., 'Wikipedia') with semantic topic labels (e.g., 'Science', 'Law') generated via a scalable pipeline of clustering, LLM summarization, and supervised classification.
Apply standard data mixing algorithms (DoReMi, RegMix) to these semantic distributions instead of source distributions to compute optimal pre-training weights.

Architecture

Multi-stage Topic Extraction Pipeline

Evaluation Highlights

+1.90 accuracy gain on Reading Comprehension tasks using Temperature-Topic mixing compared to source-based mixing (1.3B model).
PerfRe-Topic (topic-based reweighting) achieves highest average score of 45.23, outperforming source-based PerfRe (44.63) and advanced methods like DoReMi-Topic (45.00).
Scaling to 3.3B parameters increases the advantage of topic-based mixing over source-based mixing from 0.5 to 0.7 average points.

Breakthrough Assessment

7/10

Provides the first comprehensive empirical evidence that semantic partitioning is superior to source partitioning for data mixing. While the mixing algorithms themselves are standard, the pipeline and findings on semantic organization are significant for future data curation.

⚙️ Technical Details

Problem Definition

Setting: LLM Pre-training Data Mixing

Inputs: Heterogeneous pre-training corpus (SlimPajama)

Outputs: Optimal data mixture weights p over groups (topics or sources) to minimize validation loss

Pipeline Flow

Embedding & Clustering: Embed docs → K-Means (Level 1) → K-Means (Level 2)
Label Generation: LLM summarizes clusters → LLM merges summaries into taxonomy
Classifier Training: Annotate samples with LLM → Train BERT classifier
Data Mixing: Re-partition dataset by topic → Run mixing algos (PerfRe, DoReMi, etc.) → Pre-train LLM

System Modules

Embedding Model (Topic Extraction)

Convert documents into dense vectors for clustering

Model or implementation: BGE model

Cluster Summarizer (Topic Extraction)

Generate human-readable labels for clusters

Model or implementation: gpt-4o-2024-11-20

Topic Classifier (Topic Extraction)

Scale topic assignment to the full 600M document dataset

Model or implementation: BERT (fine-tuned)

Mixing Algorithm

Calculate sampling weights for each group

Model or implementation: Various (PerfRe, DoReMi, RegMix, Temperature)

Novel Architectural Elements

Topic-based partitioning pipeline: Replaces static source metadata with dynamic, classifier-driven semantic labels for the entire pre-training corpus prior to mixing

Modeling

Base Model: 1.3B and 3.3B parameter decoder-only transformers (Llama architecture)

Training Method: Pre-training from scratch

Training Data:

Subset of 30B tokens (1.3B models) or 70B tokens (3.3B models) from SlimPajama
Partitioned into 12 topics: Technology, Science, Politics, Health, Lifestyle, Law, Entertainment, Education, Relationships, Finance, Community, Others

Key Hyperparameters:

context_window: 1024
position_embeddings: RoPE

Compute: Not reported in the paper

Comparison to Prior Work

vs. DoReMi/RegMix (Standard): Applies these algorithms to *topic* distributions rather than *source* distributions.
vs. WebOrganizer: Fully automated taxonomy generation vs. human-defined taxonomy.
vs. R&B: Produces interpretable, labeled topics via LLM summarization vs. unlabeled clusters.
+ 1 more
vs. FineWeb/DCLM [not cited in paper]: Adds semantic structure to monolithic web crawls where source metadata is useless.

Limitations

Topic taxonomy is fixed to 12 broad categories; finer granularity might yield different results.
Relies on a BERT classifier (84% accuracy), so some data is likely misclassified.
Experiments limited to 1.3B and 3.3B parameter models; scaling to >7B not tested.
Comparison is on a fixed 30B/70B token budget; convergence behavior at trillion-token scale unknown.

Reproducibility

Code: https://github.com/huggingface/archived-slimpajama-topic

Code, annotated datasets, and topic classification models to be made publicly available. SlimPajama dataset is open. Uses gpt-4o for annotation (closed source dependency).

📊 Experiments & Results

Evaluation Setup

Pre-training followed by zero-shot/few-shot evaluation on downstream tasks

Benchmarks:

General Knowledge (ARC-Challenge, ARC-Easy, SciQ)
Commonsense Reasoning (PIQA, SIQA, WinoGrande, CommonsenseQA)
Reading Comprehension (RACE, OpenBookQA)

Metrics:

Average Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison across mixing strategies showing topic-based mixing consistently outperforms source-based mixing (1.3B models).
Average across all tasks	Accuracy	44.63	45.23	+0.60
Average across all tasks	Accuracy	43.81	44.67	+0.86
Average across all tasks	Accuracy	43.89	44.39	+0.50
Average across all tasks	Accuracy	44.31	45.00	+0.69
Reading Comprehension	Accuracy	25.76	27.66	+1.90
Scaling experiments (3.3B models) show the gap between topic and source methods widens.
Average across all tasks	Accuracy	49.29	50.06	+0.77

Experiment Figures

Distribution of the 12 extracted topics in SlimPajama

NPMI matrix showing correlation between Sources and Topics

Main Takeaways

Topic-based data mixing consistently outperforms source-based mixing across all tested algorithms (PerfRe, DoReMi, RegMix, Temperature).
The performance gap between topic-based and source-based mixing increases as model size scales (from 1.3B to 3.3B).
PerfRe (Performance-based Reweighting) proved to be the most effective mixing strategy overall, surpassing more complex methods like DoReMi and RegMix.
Topic-based organization improves the optimization landscape, achieving significantly lower validation loss compared to source-based approaches.

📚 Prerequisite Knowledge

Prerequisites

LLM Pre-training pipeline
Data Mixing / Reweighting strategies
Clustering algorithms (K-Means)
Text Embedding models

Key Terms

Data Mixing: The process of determining the optimal proportion of different data groups (e.g., sources or topics) in the pre-training corpus to maximize model performance

SlimPajama: A large-scale, deduplicated, open-source dataset for LLM pre-training, cleaned from RedPajama

DoReMi: Domain Reweighting with Minimax Optimization—an algorithm that trains a small proxy model to find data weights that minimize worst-case loss

RegMix: Regression-based Mixing—an approach that trains small models on random mixtures, fits a regression model to predict performance, and optimizes weights

PerfRe: Performance-based Reweighting—a heuristic method proposed in this paper where data groups are upsampled based on their empirical benefit to downstream tasks

NPMI: Normalized Pointwise Mutual Information—a measure used here to quantify the correlation (or lack thereof) between data sources and semantic topics

Llama tokens: Tokens generated by the tokenizer used in the Llama family of models

RoPE: Rotary Position Embeddings—a method for encoding positional information in transformer models

Group DRO: Group Distributionally Robust Optimization—an optimization technique used in DoReMi to minimize the loss of the worst-performing group