← Back to Paper List

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, et al.
AI @ Meta
arXiv (2024)
Pretraining RL Reasoning MM Speech Benchmark Agent

📝 Paper Summary

Foundation Models Large Language Models (LLMs) Multimodal Models
Llama 3 is a family of dense Transformer models (up to 405B parameters) trained on 15T tokens that achieves state-of-the-art performance via massive data scaling and simplified post-training.
Core Problem
Developing open foundation models that rival the best closed-source models (like GPT-4) in reasoning, coding, and multilinguality requires overcoming challenges in data quality, training stability at scale, and complex post-training alignment.
Why it matters:
  • Closed models currently dominate high-end capabilities, limiting research transparency and community innovation
  • Prior open models often lagged in complex reasoning, coding, and multilingual tasks compared to proprietary counterparts
  • Scalable training of 400B+ parameter models is notoriously unstable and operationally difficult
Concrete Example: When users ask complex reasoning or coding questions, smaller or less-optimized open models often hallucinate or fail to follow multi-step instructions, whereas Llama 3 405B solves these with accuracy comparable to GPT-4.
Key Novelty
Llama 3 Herd of Models
  • Massive scaling of dense Transformers: Training a 405B parameter model on 15.6 trillion tokens, far exceeding standard compute-optimal scaling laws
  • Simplified but rigorous post-training: Utilizing Supervised Fine-Tuning (SFT), Rejection Sampling, and Direct Preference Optimization (DPO) rather than complex Reinforcement Learning from Human Feedback (RLHF)
  • Compositional multimodal approach: Integrating separate pre-trained encoders for image and speech via adapters rather than training a monolithic multimodal model from scratch
Evaluation Highlights
  • Llama 3 405B achieves 88.6% on MMLU (5-shot), comparable to GPT-4's 88.7%
  • Llama 3 405B scores 96.8% on GSM8K (math reasoning), outperforming GPT-4 (94.2%) and GPT-4o (96.1%)
  • Llama 3 405B achieves 89.0% on HumanEval (coding), rivaling GPT-4 (86.6%) and GPT-4o (90.2%)
Breakthrough Assessment
10/10
Represents a massive leap for open weights models, matching GPT-4 class performance for the first time in a widely released model. The 405B scale and data volume set a new standard for open AI.
×