← Back to Paper List

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample
Meta AI
arXiv (2023)
Pretraining Benchmark Reasoning QA

📝 Paper Summary

Foundation Models Large Language Models (LLMs) Open-Source Models
LLaMA demonstrates that training smaller models on significantly more tokens than Chinchilla optimal scaling suggests yields state-of-the-art performance at a fraction of the inference cost, using only publicly available data.
Core Problem
Previous scaling laws (like Chinchilla) optimize for training compute but ignore inference budgets, leading to massive models that are expensive to serve and often trained on proprietary, inaccessible datasets.
Why it matters:
  • Serving large language models at scale is computationally prohibitive; a smaller model trained longer is cheaper at inference time
  • Reliance on undocumented or proprietary data hinders open research, reproducibility, and the study of bias/toxicity
  • Access to competitive LLMs has been limited to large industrial labs with massive compute resources
Concrete Example: While Chinchilla recommends training a 10B model on 200B tokens, LLaMA shows that a 7B model's performance continues to improve well past 1T tokens, eventually outperforming much larger models like GPT-3 (175B) on many tasks while running on a single GPU.
Key Novelty
Over-training smaller models on massive public datasets
  • Train models ranging from 7B to 65B parameters on trillions of tokens (far beyond the Chinchilla optimal point) to maximize inference efficiency rather than training efficiency
  • Construct a pre-training corpus entirely from publicly available sources (CommonCrawl, C4, GitHub, Wikipedia, ArXiv, etc.) compatible with open-sourcing
  • Integrate architectural improvements from disparate top models (PaLM's SwiGLU, GPT-Neo's Rotary Embeddings, GPT-3's pre-normalization) into a single stable architecture
Evaluation Highlights
  • LLaMA-13B outperforms GPT-3 (175B) on most benchmarks despite being 10x smaller, enabling single-GPU deployment
  • LLaMA-65B is competitive with the best available models, Chinchilla-70B and PaLM-540B, across common sense reasoning and reading comprehension tasks
  • LLaMA-65B achieves state-of-the-art zero-shot/few-shot performance on NaturalQuestions and TriviaQA, beating GPT-3 and Chinchilla
Breakthrough Assessment
10/10
Marked a pivotal shift in the field by proving open-data, smaller 'over-trained' models could beat proprietary giants, effectively democratizing LLM research and spawning the open-source LLM ecosystem.
×