Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

📝 Paper Summary

Data Curation for LLM Pre-training Synthetic Data Generation

ReWire recycles discarded low-quality web documents by using an LLM to rewrite them into high-quality training data via chain-of-thought reasoning, effectively doubling the usable data pool.

Core Problem

High-quality natural text data is scarce, and standard filtering pipelines discard up to 99% of web crawls, creating a 'data wall' that limits model scaling.

Why it matters:

The stock of high-quality public human text is projected to be exhausted between 2026 and 2032
Discarding 99% of data is inefficient when compute resources continue to grow
Current synthetic data methods often focus on specific formats (Q&A) rather than general pre-training corpora

Concrete Example: A web document might contain useful facts but be poorly written, incoherent, or unstructured (e.g., a messy product listing). Standard filters discard it. ReWire identifies the core purpose and rewrites it into a coherent document, making it usable for training.

Key Novelty

Recycling the Web with Guided Rewrite (ReWire)

Instead of discarding low-quality documents, use a strong LLM (Llama-3.3-70B) to reason about their content and rewrite them into high-quality text
Treats web scraps as 'initial drafts' rather than final training data, using the LLM to improve structure and coherence while retaining information
Demonstrates that mixing this 'recycled' synthetic data with real high-quality data works better than using real data alone or simply repeating real data

Architecture

The overall data generation pipeline for ReWire

Evaluation Highlights

+2.5 percentage points improvement on CORE average accuracy (22 tasks) at 7B scale when mixing recycled data with raw text vs. raw text alone
Matches or outperforms the performance of training on 2x more raw data, effectively doubling the token yield
82% of the useful synthetic data comes from documents that would have been discarded by standard quality filters

Breakthrough Assessment

8/10

Offers a scalable solution to the impending 'data wall' by proving that 'trash' data can be recycled into high-quality training signal, effectively acting as a data multiplier.

⚙️ Technical Details

Problem Definition

Setting: LLM Pre-training with a fixed, limited budget of raw web data

Inputs: A large pool of moderate-quality web documents (DCLM-RefinedWeb) normally subjected to aggressive filtering

Outputs: A pre-trained Language Model

Pipeline Flow

Data Selection: Start with moderate-quality web pool (DCLM-RefinedWeb)
Guided Rewriting: Prompt Llama-3.3-70B-Instruct to reason and rewrite documents
Quality Filtering: Train fastText classifier to filter synthetic outputs
Training: Mix high-quality raw texts with high-quality rewritten texts

System Modules

Guided Rewriter

Transform low/moderate quality web text into high-quality training examples

Model or implementation: Llama-3.3-70B-Instruct

Synthetic Filter

Select only the highest quality rewritten texts

Model or implementation: fastText classifier

Modeling

Base Model: Llama-2 architecture (1B, 3B, 7B parameters)

Training Method: Standard Pre-training (Next Token Prediction)

Objective Functions:

Purpose: Train the model to predict the next token.

Formally: Standard cross-entropy loss.

Training Data:

Baselines: Top 10% or 20% of DCLM-RefinedWeb (ranked by fastText)
ReWire: Mix of Raw text (top 10%) + Rewritten text (top 10%)

Key Hyperparameters:

tokens_seen_1B: 28.8B (1x Chinchilla) to 144B (5x)
tokens_seen_3B: 55.9B (1x)
tokens_seen_7B: 138B (1x) to 276B (2x)
+ 3 more
context_length: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Nemotron-CC: ReWire focuses on rewriting the *entire* document based on its purpose rather than just extracting knowledge or formatting as QA; ReWire recycles discarded data rather than just augmenting high-quality data
vs. PreSelect: ReWire generates *new* tokens via rewriting rather than just selecting existing ones
vs. Simple Rephrasing: ReWire uses Chain-of-Thought to add structure and coherence, resulting in higher diversity and semantic change than simple paraphrasing

Limitations

Dependency on a very large, high-quality teacher model (Llama-3.3-70B) for rewriting, which is computationally expensive
Rewriting process may hallucinate or alter factual content, though semantic similarity analysis suggests it largely stays on topic
Experiments limited to relatively small models (up to 7B) compared to state-of-the-art frontier models
Does not explicitly address decontamination of test sets from the synthetic data

Reproducibility

Code: https://huggingface.co/datasets/facebook/recycling_the_web

High-quality synthetic data is publicly available on HuggingFace. The exact prompt for rewriting is in Appendix B (not shown in snippet but referenced). Code for the Lingua framework (used for training) is cited but the specific experiment scripts are not linked.

📊 Experiments & Results

Evaluation Setup

Pre-training language models from scratch on different data mixtures and evaluating on standard benchmarks

Benchmarks:

DCLM Benchmark (Suite of 22 language tasks (e.g., HellaSwag, ARC))
MMLU (5-shot Multi-choice Question Answering)

Metrics:

CORE (Centered Accuracy)
MMLU 5-shot accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results demonstrating performance gains of ReWire across model scales (1B, 3B, 7B) compared to high-quality raw data baselines.
DCLM (Average CORE)	CORE	0.289	0.299	+0.010
DCLM (Average CORE)	CORE	0.362	0.375	+0.013
DCLM (Average CORE)	CORE	0.420	0.445	+0.025
MMLU	Accuracy	0.326	0.447	+0.121
Comparison with 2x Data Scale: ReWire matches or beats training on double the amount of raw data.
DCLM (Average CORE)	CORE	0.425	0.445	+0.020
Comparison with other synthetic data methods shows ReWire's superiority.
DCLM (Average CORE)	CORE	0.364	0.375	+0.011
DCLM (Average CORE)	CORE	0.368	0.375	+0.007

Experiment Figures

Scatter plot of Raw Text Quality Scores vs. Rewritten Text Quality Scores for 10K documents

Diversity analysis (Bigram count) of different data sources

Main Takeaways

Mixing high-quality raw text with guided rewritten text consistently outperforms using raw text alone across 1B, 3B, and 7B scales.
The method is effectively a data multiplier: training with ReWire on a fixed data pool matches the performance of having access to 2x the amount of raw data.
ReWire outperforms other synthetic data approaches like Wikipedia rephrasing, QA synthesis, and knowledge extraction on general benchmarks.
There is little correlation between the quality of the raw input and the quality of the rewritten output, confirming that 'trash' data can indeed be recycled.
The rewritten data is distinct from the raw data (only 18.3% overlap in selected documents), proving that the gains come from recycling discarded documents.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM pre-training scaling laws (Chinchilla)
Familiarity with data filtering pipelines (e.g., fastText classifiers)
Knowledge of synthetic data generation techniques

Key Terms

DCLM: DataComp-LM—a benchmark for evaluating dataset curation strategies for language models

fastText: A library for efficient text classification, used here to score and filter documents based on quality

Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before producing the final answer

Chinchilla multiplier: A scaling rule (often denoted as 1x, 5x, etc.) defining the optimal ratio of training tokens to model parameters; 1x implies roughly 20 tokens per parameter

CORE metric: Centered Accuracy—a metric from DCLM that normalizes task performance so 0 is random guessing and 1 is perfect accuracy

MMLU: Massive Multitask Language Understanding—a benchmark covering 57 subjects across STEM, the humanities, and social sciences

SimCSE: Simple Contrastive Learning of Sentence Embeddings—a method for training sentence embeddings, used here to measure semantic similarity

t-SNE: t-Distributed Stochastic Neighbor Embedding—a technique for visualizing high-dimensional data in 2D or 3D