DCLM: DataComp-LM—a benchmark for evaluating dataset curation strategies for language models
fastText: A library for efficient text classification, used here to score and filter documents based on quality
Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before producing the final answer
Chinchilla multiplier: A scaling rule (often denoted as 1x, 5x, etc.) defining the optimal ratio of training tokens to model parameters; 1x implies roughly 20 tokens per parameter
CORE metric: Centered Accuracy—a metric from DCLM that normalizes task performance so 0 is random guessing and 1 is perfect accuracy
MMLU: Massive Multitask Language Understanding—a benchmark covering 57 subjects across STEM, the humanities, and social sciences
SimCSE: Simple Contrastive Learning of Sentence Embeddings—a method for training sentence embeddings, used here to measure semantic similarity
t-SNE: t-Distributed Stochastic Neighbor Embedding—a technique for visualizing high-dimensional data in 2D or 3D