Common Crawl (CC): A massive, open repository of web crawl data used as the primary source for LLM pretraining
DSIR: Domain Selection via Importance Resampling—a method to select data from a raw corpus that matches the distribution of a high-quality target corpus
KenLM: A library for efficient n-gram language modeling, often used to filter low-quality text based on perplexity
UniMax: A sampling strategy that caps the number of epochs (repetitions) for any data source to ensure fair representation of low-resource domains
Alpha sampling: A heuristic sampling method where the probability of sampling a dataset is proportional to its size raised to the power of alpha (smoothing the distribution)
DoReMi: Domain Reweighting with Minimax Optimization—a method that uses a proxy model to learn optimal data sampling weights
Perplexity: A measurement of how well a probability model predicts a sample; high perplexity in filtering often indicates low-quality or unnatural text
DeBERTaV3: A transformer-based model used here as a classifier to label data attributes (toxicity, quality, domain) across the corpus