← Back to Paper List

Training compute-optimal LLMs

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre
DeepMind
arXiv (2022)
Pretraining Benchmark

📝 Paper Summary

Scaling Laws LLM Pre-training Compute Optimization
For compute-optimal training, model size and the number of training tokens should be scaled equally, contradicting prior laws that favored scaling model size much faster than data.
Core Problem
Prior scaling laws (Kaplan et al., 2020) suggested that as compute budgets increase, model size should scale much faster than training data, leading to the creation of massive but undertrained models.
Why it matters:
  • Current LLMs (like Gopher, GPT-3, MT-NLG) are significantly larger than necessary for their compute budget, wasting resources during training and inference.
  • Inference costs scale with model size; oversized models make downstream deployment and fine-tuning prohibitively expensive and slow.
  • Accurately estimating hyperparameters is critical because training large models is extremely capital-intensive and typically done only once.
Concrete Example: Gopher (280B parameters) was trained on 300B tokens. The paper finds that for the same compute budget, a 67B parameter model trained on 1.5T tokens would achieve lower loss and better downstream performance.
Key Novelty
Equal Scaling of Parameters and Data (Chinchilla Scaling Laws)
  • Conducts three distinct analyses (IsoFLOP profiles, parametric loss modelling, fixed-size varying-tokens) on over 400 models to re-estimate the optimal trade-off.
  • Demonstrates that for every doubling of model size, the number of training tokens should also double (1:1 scaling), rather than the previously believed 3:1 ratio.
  • Validates this by training Chinchilla (70B), which matches Gopher's compute budget but uses 4x more data and 4x fewer parameters.
Evaluation Highlights
  • Chinchilla (70B) outperforms Gopher (280B), GPT-3 (175B), and MT-NLG (530B) on the MMLU benchmark, reaching a state-of-the-art average accuracy of 67.5% (+7.6% over Gopher).
  • On the BIG-bench benchmark, Chinchilla outperforms Gopher on 58 out of 62 tasks, improving average accuracy by 10.7%.
  • Chinchilla achieves new SOTA on Natural Questions closed-book QA (35.5% 64-shot accuracy) compared to Gopher (28.2%), despite having 4x fewer parameters.
Breakthrough Assessment
10/10
Fundamentally reshaped the field's understanding of scaling laws, proving that data volume is as critical as model size. Directly influenced the design of nearly all subsequent major LLMs (Llama, PaLM 2, etc.).
×