QuaDMix: The proposed framework that parameterizes data sampling probabilities based on quality and domain to jointly optimize them.
Proxy Model: A small model (e.g., 1M parameters) trained to estimate the performance of larger models, allowing for cheap exploration of hyperparameters.
LightGBM: A gradient boosting framework that uses tree-based learning algorithms, used here as a regressor to predict model loss from sampling parameters.
RefinedWeb: A large-scale English web dataset used as the source corpus for pretraining experiments.
RegMix: A baseline method that optimizes data mixtures (diversity) using regression on proxy model results but does not jointly optimize quality filtering.
AskLLM: A quality filtering method that uses a prompted LLM to score data quality.
SwiGLU: A widely used activation function in LLMs, a variant of GLU (Gated Linear Unit).
RoPE: Rotary Positional Embeddings, a method for encoding position information in transformer models.