Revisiting Scaling Laws for Language Models: The Role of Data Quality and Training Strategies

📝 Paper Summary

Scaling Laws Data Quality Training Efficiency

Traditional scaling laws fail in high-density data regimes and over-training scenarios; a new sub-optimal scaling law incorporating data density and allocation ratios better predicts performance deceleration.

Core Problem

Traditional scaling laws predict power-law improvements, but recent large models exhibit 'sub-scaling' where performance gains decelerate significantly when trained on redundant data or with non-optimal compute allocation.

Why it matters:

Blindly increasing model size or data volume yields diminishing returns, wasting massive computational resources.
Current laws fail to account for data redundancy (density) and over-training (training small models on massive tokens), leading to inaccurate performance forecasts.
Understanding these limits is crucial for efficient training of Large Language Models (LLMs) like LLaMA-3 which deviate from Chinchilla optimality.

Concrete Example: When training LLaMA-3-8B on 15T tokens (massive over-training), performance gains slow down compared to LLaMA-2's trajectory despite better strategies. Traditional laws predict lower loss than actually observed because they ignore the redundancy in such vast datasets.

Key Novelty

Sub-Optimal Scaling Law (SOSL)

Introduces a 'density' metric that penalizes data redundancy: high-density clusters (repetitive concepts) contribute less information gain than diverse, low-density data.
Generalizes the Chinchilla scaling law by adding decay factors based on data density and the Over-Training Ratio (OTR), mathematically modeling the diminishing returns observed in practice.

Architecture

Conceptual visualization of data density. Figure 5 shows semantic clusters (circles) where high density = many similar samples (redundancy). Figure 6 plots sample count vs. cluster ID.

Evaluation Highlights

Proposed Sub-Optimal Scaling Law reduces prediction error (MAPE) from 0.0245 (traditional law) to 0.0016 on 500B token training runs.
Identifies a critical Over-Training Ratio (OTR) threshold of 50; beyond this point, increasing data volume yields stabilizing (diminishing) returns on loss reduction.
Demonstrates that low-density data subsets (selected via the proposed metric) sustain linear-like performance growth longer than high-density raw data.

Breakthrough Assessment

8/10

Provides a mathematically grounded correction to widely used scaling laws, specifically addressing the modern regime of 'over-training' small models on massive data.

⚙️ Technical Details

Problem Definition

Setting: Modeling the relationship between autoregressive language model loss L, model size N, dataset size D, and compute budget C.

Inputs: Model parameters N, training tokens D, dataset density ρ

Outputs: Predicted Cross-Entropy Loss L(N, D)

Pipeline Flow

Dataset Analysis (Compute Density)
Model Training (Various Sizes & OTRs)
Scaling Law Fitting (Traditional vs. Sub-Optimal)

System Modules

Density Calculator

Computes dataset density ρ considering intra-cluster concentration and inter-cluster separation.

Model or implementation: Mathematical Formula (Eq. 3)

Language Models

Train autoregressive models to generate loss curves.

Model or implementation: Transformer Decoder (20M to 7B parameters)

Scaling Law Fitter

Fits parametric models to observed loss data to evaluate predictive accuracy.

Model or implementation: Regression (Least Squares)

Novel Architectural Elements

Sub-Optimal Scaling Law formulation: L(N, D) = E + λN * RN / N^αN + λD * RD / D^αD, where RN and RD are logistic decay factors dependent on OTR and density.

Modeling

Base Model: Transformer Decoder (sizes: 20M, 47M, 113M, ..., 7B)

Training Method: Standard Autoregressive Pre-training

Training Data:

The Pile (Density 0.64)
Deduplicated Pile (Density 0.56)
Density-Based Pile (Density 0.47, selected subset)

Key Hyperparameters:

learning_rate: 2e-4 (small models) to 1.25e-4 (large models), cosine decay
optimizer: AdamW (beta1=0.9, beta2=0.95)
batch_size: Dynamic (scaled with loss/model size)
+ 1 more
sequence_length: Not explicitly reported in the paper

Compute: Experiments cover over 400 models ranging from 20M to 7B parameters. Exact GPU hours not reported.

Comparison to Prior Work

vs. Chinchilla: Explicitly models non-optimal (over-training) regimes where D >> N, adding decay terms.
vs. Muennighoff et al.: Uses a novel density metric (intra+inter cluster) rather than just repetition count to quantify data quality.
vs. Gadre et al. (2024) [cited]: Goes beyond finding optimal hyperparameters to mathematically formulating the sub-scaling law itself.

Limitations

Experiments limited to models up to 7B parameters; behavior at 100B+ scale inferred but not fully tested.
Focuses primarily on over-training (high D/N), less exploration of under-training.
Density metric relies on embedding quality and clustering, which can be computationally expensive for massive datasets.

Reproducibility

Code: https://github.com/AnonymousCode222/SOSL

Code and data released at https://github.com/AnonymousCode222/SOSL. Dataset density metrics and exact model architectures (layers, hidden size) provided in tables. Hyperparameter schedules follow Chinchilla recommendations.

📊 Experiments & Results

Evaluation Setup

Pre-training language models on datasets of varying densities and measuring loss/performance scaling.

Benchmarks:

The Pile (Standard, Deduplicated, Density-Selected) (Language Modeling)
MMLU (Multi-task Language Understanding)

Metrics:

Cross-Entropy Loss (Validation)
MMLU Accuracy
MAPE (Mean Absolute Percentage Error) of scaling law fit
Statistical methodology: Shapiro-Wilk Test to verify distribution of scaling exponents.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Fitting accuracy comparison shows the proposed Sub-Optimal Scaling Law significantly outperforms the traditional Scaling Law in predicting loss for large training runs.
The Pile (500B tokens)	MAPE (Prediction)	0.02451	0.00156	-0.02295
The Pile (30B tokens)	MAPE (Prediction)	0.01151	0.00249	-0.00902
MMLU (7B Model)	Fitting Accuracy (Visual)	High error (visual)	Low error (visual)	Not reported in the paper

Experiment Figures

Loss vs Model Size curves showing the sub-scaling phenomenon. The actual loss (points) diverges upwards from the extrapolated straight line (dashed) as training scales.

Loss vs Model Size for fixed compute, comparing Traditional vs Sub-Optimal Scaling Law fits.

Main Takeaways

High data density (redundancy) causes performance to plateau earlier; low-density data sustains linear-like gains longer.
Over-Training Ratio (OTR) > 50 triggers a regime where the scaling exponent stabilizes, meaning adding more data yields significantly diminishing returns.
Traditional scaling laws underestimate loss in over-trained regimes because they assume optimal resource allocation.
Optimizing batch size follows a power-law relationship with loss, robust even in sub-scaling regimes.

📚 Prerequisite Knowledge

Prerequisites

Scaling Laws (Chinchilla/Kaplan)
Information Theory (Entropy/Redundancy)
Language Model Training Dynamics

Key Terms

Sub-scaling: A phenomenon where performance improvements decelerate faster than predicted by traditional power laws, often due to data redundancy or non-optimal resource allocation.

Data Density: A metric quantifying redundancy; high density means samples are clustered closely (repetitive), contributing less new information.

OTR: Over-Training Ratio—the ratio of training tokens D to model parameters N (D/N). High OTR indicates training a relatively small model on a massive amount of data.

Chinchilla Law: A scaling law proposing that for compute-optimal training, model size and training tokens should scale equally.

MAPE: Mean Absolute Percentage Error—a measure of prediction accuracy used to evaluate how well the scaling laws fit the actual loss curves.

The Pile: A large-scale, diverse text dataset commonly used for training LLMs, consisting of 22 different domains.

Common Crawl: A massive dataset of web crawl data, often containing high redundancy.

FLOPs: Floating Point Operations—a measure of compute budget. For Transformers, usually approximated as 6 * N * D.