Training compute-optimal LLMs

📝 Paper Summary

Scaling Laws LLM Pre-training Compute Optimization

For compute-optimal training, model size and the number of training tokens should be scaled equally, contradicting prior laws that favored scaling model size much faster than data.

Core Problem

Prior scaling laws (Kaplan et al., 2020) suggested that as compute budgets increase, model size should scale much faster than training data, leading to the creation of massive but undertrained models.

Why it matters:

Current LLMs (like Gopher, GPT-3, MT-NLG) are significantly larger than necessary for their compute budget, wasting resources during training and inference.
Inference costs scale with model size; oversized models make downstream deployment and fine-tuning prohibitively expensive and slow.
Accurately estimating hyperparameters is critical because training large models is extremely capital-intensive and typically done only once.

Concrete Example: Gopher (280B parameters) was trained on 300B tokens. The paper finds that for the same compute budget, a 67B parameter model trained on 1.5T tokens would achieve lower loss and better downstream performance.

Key Novelty

Equal Scaling of Parameters and Data (Chinchilla Scaling Laws)

Conducts three distinct analyses (IsoFLOP profiles, parametric loss modelling, fixed-size varying-tokens) on over 400 models to re-estimate the optimal trade-off.
Demonstrates that for every doubling of model size, the number of training tokens should also double (1:1 scaling), rather than the previously believed 3:1 ratio.
Validates this by training Chinchilla (70B), which matches Gopher's compute budget but uses 4x more data and 4x fewer parameters.

Evaluation Highlights

Chinchilla (70B) outperforms Gopher (280B), GPT-3 (175B), and MT-NLG (530B) on the MMLU benchmark, reaching a state-of-the-art average accuracy of 67.5% (+7.6% over Gopher).
On the BIG-bench benchmark, Chinchilla outperforms Gopher on 58 out of 62 tasks, improving average accuracy by 10.7%.
Chinchilla achieves new SOTA on Natural Questions closed-book QA (35.5% 64-shot accuracy) compared to Gopher (28.2%), despite having 4x fewer parameters.

Breakthrough Assessment

10/10

Fundamentally reshaped the field's understanding of scaling laws, proving that data volume is as critical as model size. Directly influenced the design of nearly all subsequent major LLMs (Llama, PaLM 2, etc.).

⚙️ Technical Details

Problem Definition

Setting: Minimizing pre-training loss L(N, D) under a constraint FLOPs(N, D) = C

Inputs: Compute budget C (FLOPs)

Outputs: Optimal model size N_opt and training tokens D_opt

Pipeline Flow

Determine compute budget C
Calculate optimal N and D using N_opt ∝ C^0.5 and D_opt ∝ C^0.5
Train Transformer model with size N on D tokens

System Modules

Estimator (Approach 1) (Scaling Analysis)

Fix model sizes (70M to 10B) and vary training tokens; find minimum loss per FLOP count

Model or implementation: Power Law Fit

Estimator (Approach 2) (Scaling Analysis)

Fix FLOP counts (IsoFLOP) and vary model size; find optimal size for each FLOP bucket

Model or implementation: Parabola Fit per IsoFLOP

Estimator (Approach 3) (Scaling Analysis)

Fit a parametric loss function L(N, D) = E + A/N^α + B/D^β to all data points

Model or implementation: Huber Loss Minimization

Novel Architectural Elements

Optimal resource allocation strategy: 1:1 scaling ratio between parameters and training tokens (vs 3:1 in prior work)

Modeling

Base Model: Chinchilla (70 Billion parameters)

Training Method: Autoregressive Language Modeling (Pre-training)

Objective Functions:

Purpose: Minimize negative log-likelihood of the next token.

Formally: Standard autoregressive Loss.

Trainable Parameters: 70 Billion

Training Data:

MassiveText dataset (same as Gopher but different subset distribution)
1.4 Trillion tokens total
Modified SentencePiece tokenizer (no NFKC normalisation)

Key Hyperparameters:

learning_rate: 1e-4 (max)
batch_size: 1.5M -> 3M tokens
optimizer: AdamW (vs Adam in Gopher)
+ 5 more
layers: 80
heads: 64
d_model: 8192
key_value_size: 128
cosine_cycle_length: Matched to total training tokens (1.4T)

Compute: Same total FLOPs as Gopher (approx 5.76e+23 FLOPs), but distributed as 4x fewer parameters and 4x more steps. Trained on TPUv3/TPUv4.

Comparison to Prior Work

vs. Gopher: Chinchilla is 4x smaller (70B vs 280B) but trained on 4x more data (1.4T vs 300B).
vs. Kaplan et al. (2020): Predicts N_opt ∝ C^0.5 instead of N_opt ∝ C^0.73. This implies current models are significantly oversized.
vs. MT-NLG: Chinchilla outperforms despite being ~7.5x smaller in parameter count.

Limitations

Analysis assumes power-law relationship; slight curvature observed at high FLOPs suggests potential overestimation of optimal size for very large budgets.
Only two comparable training runs at the largest scale (Chinchilla and Gopher); no intermediate checks at that specific scale.
Relies on high-quality text data availability; scaling data assumes quality can be maintained.
Does not account for multiple epochs (assumes infinite data regime).

Reproducibility

Not publicly available. The paper describes the methodology for determining optimal size clearly, but the trained Chinchilla model weights and the MassiveText dataset are not released. Replicating the scaling law analysis requires training 400+ models, which is computationally expensive.

📊 Experiments & Results

Evaluation Setup

Pre-training language models and evaluating on downstream zero-shot, few-shot, and fine-tuning tasks.

Benchmarks:

MMLU (Academic subjects / Knowledge)
BIG-bench (Diverse reasoning and language tasks)
The Pile (Language Modelling (Perplexity/BPB))
Natural Questions (Closed-book Question Answering)

Metrics:

Accuracy (5-shot, zero-shot)
Bits-per-byte (BPB)
Perplexity
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Chinchilla significantly outperforms comparison models on the MMLU benchmark, despite being smaller.
MMLU	Average 5-shot Accuracy	60.0	67.6	+7.6
MMLU	Average 5-shot Accuracy	43.9	67.6	+23.7
Chinchilla shows superior performance on reading comprehension and common sense reasoning tasks.
LAMBADA	Accuracy (Zero-Shot)	74.5	77.4	+2.9
RACE-h	Accuracy (Few-Shot)	71.6	82.3	+10.7
HellaSwag	Accuracy (Zero-shot)	79.2	80.8	+1.6
Chinchilla achieves new state-of-the-art on closed-book QA.
Natural Questions	Accuracy (64-shot)	28.2	35.5	+7.3
TriviaQA (unfiltered)	Accuracy (64-shot)	61.3	72.3	+11.0

Main Takeaways

Optimal Scaling: Model size and training data should scale equally (1:1) with compute budget.
Current Efficiency Gap: Existing large models (Gopher, GPT-3, MT-NLG) are severely undertrained; they are too large for their data budgets.
Downstream Benefits: A smaller, optimally trained model (Chinchilla 70B) is not only cheaper to run (inference) but also more performant than much larger models (280B-530B).
Dataset Importance: Scaling data volume and quality is just as important as engineering larger models.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer architecture and autoregressive language modeling
Familiarity with scaling laws (power laws relating compute, parameters, and loss)
Knowledge of FLOPs calculations for Transformer training

Key Terms

FLOPs: Floating Point Operations—a measure of computational work. Training compute is approximated as 6 * N * D.

IsoFLOP: A curve or profile analyzing model performance while keeping the total computational budget (FLOPs) constant, varying only model size and token count.

MMLU: Massive Multitask Language Understanding—a benchmark covering 57 subjects like STEM, humanities, and social sciences.

Chinchilla: The 70B parameter model trained in this paper using the newly derived optimal scaling parameters.

Gopher: A 280B parameter model previously trained by DeepMind, used as the primary baseline for compute-budget comparisons.

Kaplan Scaling: Refers to the 2020 OpenAI paper proposing that model size should scale much faster than data (roughly N^0.73 vs D^0.27).

MassiveText: The large-scale text dataset used for training both Gopher and Chinchilla.

BPB: Bits Per Byte—a metric for evaluating language models, equivalent to loss normalized by the length of the text in bytes.