Evaluation Setup
Pretrain new models (160M and 1.4B) on data selected by the method and compare downstream zero-shot/few-shot performance.
Benchmarks:
- SciQ (Science Question Answering)
- MMLU (General Knowledge (mentioned in motivation))
- Aggregate of 22 benchmarks (General Language Understanding (from DataComp-LM))
Metrics:
- Accuracy
- Statistical methodology: Pre-registered experiments for the 1.4B scale runs
Main Takeaways
- Perplexity correlations are a reliable signal for data quality: domains where public models show lower loss consistently yield better training data for that benchmark.
- The method scales effectively: improvements over baselines increase or hold steady when moving from 160M to 1.4B parameters.
- Outperforms DSIR (n-gram based selection) consistently across all 8 tested benchmarks at 160M scale.
- Achieves parity with state-of-the-art hand-tuned classifiers (DataComp-LM's fastText filter) without requiring any manual feature engineering or human curation.