FineWebEDU: A high-quality English dataset filtered for educational value and information density
mC4: A massive multilingual dataset from Common Crawl, generally considered lower quality than curated datasets like FineWeb
SBERT: Sentence-BERT—a modification of the BERT network that uses siamese networks to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity
Chinchilla scaling: Optimal training compute allocation rules suggesting a specific ratio of model size to training tokens (roughly 20 tokens per parameter)
TransWebEDU: A version of the FineWebEDU dataset machine-translated into other languages (e.g., French) to create a high-quality parallel corpus
MMLU: Massive Multitask Language Understanding—a benchmark covering 57 subjects across STEM, the humanities, and others
PolyLM tokenizer: A specific tokenizer designed for multilingual models to ensure fair coverage across different languages
curse of multilinguality: The phenomenon where adding more languages to a model of fixed capacity degrades performance on individual languages
DCLM classifier: A fasttext classifier trained to distinguish high-quality data (like OpenHermes) from lower-quality web data