The Llama 3 Herd of Models

📝 Paper Summary

Foundation Models Large Language Models (LLMs) Multimodal Models

Llama 3 is a family of dense Transformer models (up to 405B parameters) trained on 15T tokens that achieves state-of-the-art performance via massive data scaling and simplified post-training.

Core Problem

Developing open foundation models that rival the best closed-source models (like GPT-4) in reasoning, coding, and multilinguality requires overcoming challenges in data quality, training stability at scale, and complex post-training alignment.

Why it matters:

Closed models currently dominate high-end capabilities, limiting research transparency and community innovation
Prior open models often lagged in complex reasoning, coding, and multilingual tasks compared to proprietary counterparts
Scalable training of 400B+ parameter models is notoriously unstable and operationally difficult

Concrete Example: When users ask complex reasoning or coding questions, smaller or less-optimized open models often hallucinate or fail to follow multi-step instructions, whereas Llama 3 405B solves these with accuracy comparable to GPT-4.

Key Novelty

Llama 3 Herd of Models

Massive scaling of dense Transformers: Training a 405B parameter model on 15.6 trillion tokens, far exceeding standard compute-optimal scaling laws
Simplified but rigorous post-training: Utilizing Supervised Fine-Tuning (SFT), Rejection Sampling, and Direct Preference Optimization (DPO) rather than complex Reinforcement Learning from Human Feedback (RLHF)
Compositional multimodal approach: Integrating separate pre-trained encoders for image and speech via adapters rather than training a monolithic multimodal model from scratch

Evaluation Highlights

Llama 3 405B achieves 88.6% on MMLU (5-shot), comparable to GPT-4's 88.7%
Llama 3 405B scores 96.8% on GSM8K (math reasoning), outperforming GPT-4 (94.2%) and GPT-4o (96.1%)
Llama 3 405B achieves 89.0% on HumanEval (coding), rivaling GPT-4 (86.6%) and GPT-4o (90.2%)

Breakthrough Assessment

10/10

Represents a massive leap for open weights models, matching GPT-4 class performance for the first time in a widely released model. The 405B scale and data volume set a new standard for open AI.

⚙️ Technical Details

Problem Definition

Setting: Next-token prediction on massive multilingual text corpora followed by alignment to human instructions

Inputs: Text sequences (up to 128K context length)

Outputs: Predicted next tokens / Generated text responses

Pipeline Flow

Data Curation (Cleaning, De-duplication, Filtering)
Pre-training (Next-token prediction on 15.6T tokens)
Long-context Pre-training (Annealing on long sequences)
Post-training (SFT → DPO → SFT/DPO rounds)

System Modules

Tokenizer

Convert text to tokens

Model or implementation: Tiktoken-based BPE (128K vocabulary size)

Llama 3 405B Transformer

Predict next token / Generate text

Model or implementation: Dense Transformer (126 layers, 16,384 dimension, 128 heads)

Novel Architectural Elements

Utilization of standard dense Transformer without Mixture-of-Experts (MoE) for stability at 405B scale
Adoption of Grouped Query Attention (GQA) even in the largest model for inference efficiency
Cross-attention adapters for multimodal integration (Image/Video/Speech) into the frozen language model

Modeling

Base Model: Llama 3 (Dense Transformer)

Training Method: Supervised Fine-Tuning (SFT), Rejection Sampling (RS), and Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Pre-training next-token prediction.

Formally: Standard cross-entropy loss on next token.
Purpose: DPO alignment.

Formally: Optimizes policy to maximize margin between preferred and rejected completions based on log-probabilities.

Adaptation: Full fine-tuning (for language model); Adapters (for multimodal integration)

Training Data:

Pre-training: 15.6T tokens (50% general, 25% math/reasoning, 17% code, 8% multilingual)
Post-training: Instruction tuning data, human preference data, synthetic data

Key Hyperparameters:

learning_rate: 3e-4 (8B), 1.5e-4 (70B), 8e-5 (405B)
batch_size: Up to 16M tokens (variable across stages)
context_window: 8K (initial), extended to 128K (continued pre-training)
+ 2 more
rope_theta: 500,000
weight_decay: 0.1 * learning_rate

Compute: Pre-trained on up to 16K H100 GPUs. Total compute: 3.8 x 10^25 FLOPs.

Comparison to Prior Work

vs. GPT-4: Llama 3 405B is a dense model (not MoE) and open-weights, achieving parity on many benchmarks
vs. Mixtral 8x22B: Llama 3 uses a dense architecture for stability and is significantly larger (405B vs ~141B total params)
vs. Llama 2: Trained on roughly 8x more data (15T vs 2T tokens), significantly larger vocab (128K vs 32K), and uses GQA

Limitations

Multimodal models (image, video, speech) discussed in paper are not yet released
Very high inference cost for the 405B model due to dense architecture (requires significant hardware)
Pre-training data mix details are heuristic-based and dataset is not public

Reproducibility

Code: https://llama.meta.com/

Publicly available: Pre-trained and post-trained weights for 8B, 70B, and 405B models; Llama Guard 3. Not released: Training data (dataset not public), source code for data pipeline, multimodal models (image/video/speech integrations described but not released).

📊 Experiments & Results

Evaluation Setup

Comprehensive benchmark sweep across General, Code, Math, Reasoning, Tool Use, and Long Context tasks

Benchmarks:

MMLU (General Knowledge (5-shot))
GSM8K (Math Reasoning (8-shot, CoT))
HumanEval (Python Coding (0-shot))
ARC Challenge (Reasoning (0-shot))

Metrics:

Accuracy
Pass@1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The flagship Llama 3 405B model competes directly with top-tier proprietary models like GPT-4 and GPT-4o.
MMLU (5-shot)	Accuracy	85.4	88.6	+3.2
GSM8K (8-shot, CoT)	Accuracy	94.2	96.8	+2.6
HumanEval (0-shot)	Pass@1	86.6	89.0	+2.4
ARC Challenge (0-shot)	Accuracy	96.4	96.9	+0.5
Llama 3 70B significantly outperforms open-weights models in its size class.
MMLU (5-shot)	Accuracy	76.9	83.6	+6.7
HumanEval (0-shot)	Pass@1	75.6	80.5	+4.9

Experiment Figures

Scaling law IsoFLOPs curves

Main Takeaways

Scaling laws hold: The 405B model, trained on 15T+ tokens, delivers state-of-the-art performance, validating the 'compute-optimal' approach adjusted for inference budget
High-quality data is critical: Heavy curation, de-duplication, and synthetic data (annealing) were key to performance gains over Llama 2
Simple architecture wins: Standard dense Transformers with minor tweaks (GQA, RoPE) are sufficient for SOTA results if scaled properly; MoE is not strictly necessary for quality, though it helps inference cost
Multimodal composition works: Adding vision and speech via adapters to a frozen language model yields competitive performance without training a multimodal model from scratch (though these models are unreleased)

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention, FFNs)
Scaling laws for LLMs (Chinchilla)
Distributed training (Tensor, Pipeline, and Data Parallelism)
Post-training alignment methods (SFT, DPO)

Key Terms

SFT: Supervised Fine-Tuning—training a model on high-quality instruction-response pairs to teach it to follow instructions

DPO: Direct Preference Optimization—an alignment method that optimizes the model to prefer chosen responses over rejected ones without training a separate reward model

RoPE: Rotary Positional Embeddings—a method to encode positional information in Transformers, allowing for better handling of sequence lengths

GQA: Grouped Query Attention—an attention mechanism that shares key-value heads across multiple query heads to reduce memory bandwidth and improve inference speed

SwiGLU: A specific activation function (Swish-Gated Linear Unit) used in the feed-forward networks of the Transformer

Tiktoken: A fast BPE tokenizer library used by OpenAI and adopted here with modifications

Rejection Sampling: A technique where the model generates multiple outputs, the best is selected (by a reward model or heuristic), and the model is fine-tuned on that selected output

Scaling Laws: Empirical relationships between model size, dataset size, and compute budget that predict model performance

Annealing: A training phase at the very end where the learning rate is decayed to 0 and data quality is upsampled to boost final performance

IsoFLOPs: Curves showing the trade-off between model size and training tokens for a fixed compute budget

4D Parallelism: Combining Tensor, Pipeline, Context, and Data Parallelism to distribute training across thousands of GPUs

FSDP: Fully Sharded Data Parallelism—a technique that shards model parameters, gradients, and optimizer states across data parallel workers to save memory