PRRC: The four novel quality dimensions proposed: Professionalism, Readability, Reasoning, and Cleanliness
Proxy Models: Small-scale language models trained for a few steps to quickly estimate the effectiveness of a data selection strategy before full-scale training
SlimPajama: A widely used, deduplicated, open-source dataset for training Large Language Models
LightGBM: A gradient boosting framework that uses tree-based learning algorithms, used here to regress validation loss against quality weights
RoPE: Rotary Positional Embeddings—a method for encoding position information in transformer models
DSIR: Data Selection with Importance Resampling—a method using hashed n-gram features to select data similar to a target distribution
Perplexity (PPL): A measurement of how well a probability model predicts a sample; lower perplexity indicates the text is more 'natural' to the model