Craw4LLM: Efficient Web Crawling for LLM Pretraining

📝 Paper Summary

Data Curation for LLMs Web Crawling Pretraining Data Filtering

Craw4LLM improves pretraining data collection by prioritizing URLs based on their predicted influence on LLM quality rather than traditional graph connectivity metrics like PageRank.

Core Problem

Traditional web crawlers prioritize pages with high connectivity (e.g., high indegree), which correlates poorly with the data quality needed for LLM pretraining, leading to over 90% of crawled data being discarded.

Why it matters:

Inefficiency leads to massive waste of computational resources processing useless data
Over-crawling burdens website operators with redundant traffic, raising ethical and legal concerns
Current methods require crawling 4-5x the necessary data to find high-quality documents

Concrete Example: A standard crawler might prioritize a link farm or directory page because it has many incoming links (high indegree), even though its text is low-quality spam. Craw4LLM would ignore it if the classifier predicts low educational value, instead following a link to a high-quality but less-connected blog post.

Key Novelty

LLM-Influence-Based Crawling Priority

Replace standard graph-based priority signals (like PageRank or indegree) with a quality score derived from pretraining data filters (e.g., DCLM fastText or FineWeb-Edu classifiers)
Score unvisited URLs based on the quality of their parent pages or direct scoring, prioritizing the exploration of the web graph towards high-quality 'neighborhoods' useful for LLMs

Architecture

Conceptual workflow of Craw4LLM vs Traditional Crawlers

Evaluation Highlights

Matches the downstream performance of traditional crawls while crawling only 21% of the total URLs
Achieves >95% of the theoretical oracle performance (selecting from the full web graph) while crawling just 1x the target dataset size
Outperforms traditional crawlers that collect 2x or 4x the data followed by filtering, validating that quality-first crawling is more efficient than crawl-then-select

Breakthrough Assessment

7/10

Simple but highly effective shift in crawling paradigm. Directly addresses the massive inefficiency of current data pipelines. Simulation-based validation is a limitation, but the efficiency gains are substantial.

⚙️ Technical Details

Problem Definition

Setting: Web crawling on a known web graph snapshot (ClueWeb22-A) to collect a fixed budget of N documents for pretraining

Inputs: Seed URLs and a web graph structure

Outputs: A set of N crawled documents P used for pretraining

Pipeline Flow

Seed Initialization
Score Calculation (using Influence Scorer)
Priority Queue Management
Fetching & Expansion

System Modules

Pretraining Influence Scorer

Assign a priority score to unvisited URLs based on the predicted quality of the content

Model or implementation: DCLM fastText classifier or FineWeb-Edu classifier

Scheduler / Priority Queue

Order URLs based on scores to determine crawl order

Model or implementation: Standard Priority Queue

Novel Architectural Elements

Replacement of graph-connectivity heuristics (PageRank/indegree) with content-quality heuristics (LLM pretraining influence scores) directly in the crawler scheduler

Modeling

Base Model: 411M-parameter decoder-only Transformer (DCLM baseline architecture)

Training Data:

Subset of ClueWeb22-A (English)
Total crawled documents N = 20M
Crawled documents per iteration n = 10K
Seed URLs = 10K randomly sampled

Key Hyperparameters:

tokens: 32.9B (4x Chinchilla-optimal)
context_length: 2048
global_batch_size: 256
+ 4 more
learning_rate: 3.0e-3
warmup_steps: 1596
decay_steps: 14368
optimizer: AdamW (beta1=0.9, beta2=0.95, weight_decay=0.1)

Compute: {'training_time': '1 day and 12 hours', 'gpu_type': '8x NVIDIA L40S GPUs'}

Comparison to Prior Work

vs. Common Crawl: Prioritizes content quality over graph connectivity
vs. Crawl-then-select: Integrates selection *into* the crawling loop, reducing the volume of data that needs to be fetched and processed

Limitations

Simulated environment: Experiments run on ClueWeb22 snapshot, not the live web, avoiding real-world issues like ddos protection or changing content.
Simplified scoring: The simulation assumes static scores; in real crawling, URL scoring often relies on parent page quality or lightweight estimation.
Scope: Does not implement politeness policies, parallelization, or re-visit policies typical of industrial crawlers like Nutch.

Reproducibility

Code: https://github.com/cxcscmu/Craw4LLM

Code publicly available at https://github.com/cxcscmu/Craw4LLM. Experiments are simulations on ClueWeb22-A (requires license from CMU). Uses standard DCLM training recipe and evaluation metrics.

📊 Experiments & Results

Evaluation Setup

Pretrain 411M param models on 20M documents collected via different crawling strategies and evaluate on downstream tasks.

Benchmarks:

DCLM Evaluation Suite (Core tasks (53 tasks aggregated into 23 categories))

Metrics:

Average accuracy across 23 core tasks (MMLU, HellaSwag, ARC, etc.)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Craw4LLM significantly outperforms traditional crawling strategies when constrained to the same data budget (1x = 20M docs).
Core Tasks Average	Accuracy	0.3808	0.4136	+0.0328
Core Tasks Average	Accuracy	0.3541	0.4136	+0.0595
Even when baselines are allowed to crawl 2x the data and select the best 50%, Craw4LLM (fetching only 1x) still outperforms them.
Core Tasks Average	Accuracy	0.3875	0.4136	+0.0261
Core Tasks Average	Accuracy	0.4339	0.4136	-0.0203

Experiment Figures

Crawling efficiency curves: Downstream LLM accuracy vs. number of documents crawled/visited

Precision/Recall of the crawler finding 'Oracle' documents over time

Main Takeaways

Graph connectivity (indegree) is a poor proxy for LLM data quality; widely connected pages are often not the most educational or informative.
Craw4LLM achieves the performance of a 4.8x larger traditional crawl while visiting only ~21% of the pages.
High-quality documents tend to link to other high-quality documents (score correlation across hops), validating the strategy of following high-scoring paths.
Precision of fetched documents quickly reaches 1.0 relative to Oracle selection, meaning the crawler effectively stays within the 'high-quality' subgraph.

📚 Prerequisite Knowledge

Prerequisites

Understanding of web crawling basics (seeds, frontiers, priority queues)
Familiarity with LLM pretraining data pipelines (Common Crawl, filtering)
Basic graph theory (indegree, PageRank)

Key Terms

DCLM: DataComp for Language Models—a benchmark and dataset for exploring how data curation affects LLM performance

FineWeb-Edu: A dataset and classifier designed to identify high-quality educational content on the web

indegree: The number of incoming links to a webpage; often used as a proxy for page importance in traditional crawling

oracle selection: An idealized selection method that can see the entire web graph at once and pick the absolute best pages, serving as a performance upper bound

fastText: A library for efficient text classification and representation learning, used here to score document quality

frontier: The set of unvisited URLs that the crawler is aware of and plans to visit

ClueWeb22: A large-scale web dataset serving as a snapshot of the English web, used here to simulate crawling without actual network traffic