Data Mixing: The process of determining the optimal proportion of different data groups (e.g., sources or topics) in the pre-training corpus to maximize model performance
SlimPajama: A large-scale, deduplicated, open-source dataset for LLM pre-training, cleaned from RedPajama
DoReMi: Domain Reweighting with Minimax Optimization—an algorithm that trains a small proxy model to find data weights that minimize worst-case loss
RegMix: Regression-based Mixing—an approach that trains small models on random mixtures, fits a regression model to predict performance, and optimizes weights
PerfRe: Performance-based Reweighting—a heuristic method proposed in this paper where data groups are upsampled based on their empirical benefit to downstream tasks
NPMI: Normalized Pointwise Mutual Information—a measure used here to quantify the correlation (or lack thereof) between data sources and semantic topics
Llama tokens: Tokens generated by the tokenizer used in the Llama family of models
RoPE: Rotary Position Embeddings—a method for encoding positional information in transformer models
Group DRO: Group Distributionally Robust Optimization—an optimization technique used in DoReMi to minimize the loss of the worst-performing group