MeCo: Metadata Conditioning then Cooldown—the proposed pre-training curriculum
cooldown: A final training phase using standard data (no metadata) to ensure the model functions without conditioning
conditional inference: Prepending specific metadata (real or fabricated URLs) to the prompt at test time to steer model output
OLMES: A standardized evaluation suite for language models including tasks like MMLU, ARC, and HellaSwag
DCLM: DataComp-LM—a high-quality dataset derived from CommonCrawl using fastText filtering
C4: Colossal Clean Crawled Corpus—a standard web-scale pre-training dataset
RefinedWeb: A high-quality web dataset filtered for quality text
fastText: A library for efficient text classification, used here for data quality filtering
perplexity: A metric measuring how well a probability model predicts a sample; lower is better