Evaluation Setup
Pretrain 411M param models on 20M documents collected via different crawling strategies and evaluate on downstream tasks.
Benchmarks:
- DCLM Evaluation Suite (Core tasks (53 tasks aggregated into 23 categories))
Metrics:
- Average accuracy across 23 core tasks (MMLU, HellaSwag, ARC, etc.)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
ฮ |
| Craw4LLM significantly outperforms traditional crawling strategies when constrained to the same data budget (1x = 20M docs). |
| Core Tasks Average |
Accuracy |
0.3808 |
0.4136 |
+0.0328
|
| Core Tasks Average |
Accuracy |
0.3541 |
0.4136 |
+0.0595
|
| Even when baselines are allowed to crawl 2x the data and select the best 50%, Craw4LLM (fetching only 1x) still outperforms them. |
| Core Tasks Average |
Accuracy |
0.3875 |
0.4136 |
+0.0261
|
| Core Tasks Average |
Accuracy |
0.4339 |
0.4136 |
-0.0203
|
Main Takeaways
- Graph connectivity (indegree) is a poor proxy for LLM data quality; widely connected pages are often not the most educational or informative.
- Craw4LLM achieves the performance of a 4.8x larger traditional crawl while visiting only ~21% of the pages.
- High-quality documents tend to link to other high-quality documents (score correlation across hops), validating the strategy of following high-scoring paths.
- Precision of fetched documents quickly reaches 1.0 relative to Oracle selection, meaning the crawler effectively stays within the 'high-quality' subgraph.