Llama 2: Open Foundation and Fine-Tuned Chat Models

📝 Paper Summary

Large Language Models (LLMs) Open-source LLMs RLHF (Reinforcement Learning with Human Feedback)

Llama 2 is a family of open-access foundation and chat-optimized models (7B to 70B parameters) that use iterative RLHF and novel attention mechanisms to match closed-source model performance.

Core Problem

Existing open-source LLMs lag behind closed 'product' LLMs (like ChatGPT) in usability and safety because they lack the extensive, expensive fine-tuning required for human alignment.

Why it matters:

Closed-source models are opaque, limiting community research into AI alignment and safety.
High computational and annotation costs prevent most researchers from developing aligned models from scratch.
Publicly released pretrained models (like LLaMa-1) are not suitable substitutes for product-level chat assistants without heavy fine-tuning.

Concrete Example: When given a prompt to 'Write a poem to help me remember the first 10 elements', a standard SFT model might hallucinate or produce unstructured text. Llama 2-Chat uses iterative RLHF to ensure the poem is factual, structured, and helpful.

Key Novelty

Iterative RLHF with Ghost Attention (GAtt)

Uses two separate reward models (one for helpfulness, one for safety) to resolve the tension between refusing unsafe requests and answering helpful ones.
Introduces Ghost Attention (GAtt), a method to help the model maintain instructions (like 'act as a pirate') over multiple turns of dialogue without forgetting.
Employs an iterative fine-tuning process where Rejection Sampling and PPO are applied sequentially, with the model distribution updated weekly based on new human preference data.

Architecture

The complete training pipeline for Llama 2-Chat, from pretraining to iterative RLHF.

Evaluation Highlights

Llama 2 70B scores 68.9% on MMLU (5-shot), outperforming Llama 1 65B (63.4%) and approaching GPT-3.5 (70.0%).
In human evaluations for helpfulness, Llama 2-Chat 70B outperforms ChatGPT (win rate not explicitly quantified as a single number but shown dominating in Figure 1).
Llama 2 70B scores 56.8% on GSM8K (8-shot), significantly improving over Llama 1 65B (50.9% inferred from context/graphs) but trailing GPT-4 (92.0%).

Breakthrough Assessment

9/10

A definitive open-weights release that established a new baseline for open-source models, narrowing the gap with closed proprietary models like GPT-3.5 and enabling widespread alignment research.

⚙️ Technical Details

Problem Definition

Setting: Pretraining on self-supervised data followed by alignment to human preferences via supervised fine-tuning and reinforcement learning.

Inputs: Text prompts (single or multi-turn dialogue)

Outputs: Text completions/responses

Pipeline Flow

Pretraining (Self-supervised learning on 2T tokens)
Supervised Fine-Tuning (SFT on ~27k high-quality examples)
Reward Modeling (Training separate Safety and Helpfulness models)
RLHF (Iterative Rejection Sampling and PPO)

System Modules

Llama 2 Base Model

Next-token prediction on massive text corpus

Model or implementation: Auto-regressive transformer (7B, 13B, 70B)

Reward Models (Alignment)

Score responses for Helpfulness and Safety

Model or implementation: Initialized from chat checkpoints, replaced head with regression head

RLHF Fine-Tuner (Alignment)

Optimize policy toward high reward scores

Model or implementation: Llama 2-Chat

Novel Architectural Elements

Ghost Attention (GAtt): A data augmentation strategy in fine-tuning where instructions are concatenated to all user turns to enforce constraint consistency across dialogue.
Grouped-Query Attention (GQA): Used in 34B and 70B models to improve inference scalability (architectural change from Llama 1).

Modeling

Base Model: Llama 2 (7B, 13B, 70B)

Training Method: Iterative RLHF using Rejection Sampling and Proximal Policy Optimization (PPO)

Objective Functions:

Purpose: SFT standard loss.

Formally: Autoregressive objective (cross-entropy) on answer tokens only.
Purpose: Reward Model Ranking Loss.

Formally: L_ranking = -log(σ(r(x, y_c) - r(x, y_r) - m(r))), employing a margin m(r) based on preference certainty.
Purpose: PPO Objective.

Formally: R(g|p) = ~Rc(g|p) - β * D_KL(π_θ(g|p) || π_0(g|p)), penalizing divergence from the initial policy.

Adaptation: Full fine-tuning

Trainable Parameters: Full weights updated

Training Data:

Pretraining: 2 Trillion tokens
SFT: 27,540 high-quality annotations
Reward Modeling: >1 million binary comparisons (Meta Safety + Helpfulness data)

Key Hyperparameters:

learning_rate: Pretraining: 1.5e-4 to 3.0e-4 (depending on size); SFT: 2e-5; RLHF: 5e-6 to 1e-5
batch_size: Pretraining: 4M tokens global batch size; SFT: 64
context_length: 4096 tokens
+ 4 more
weight_decay: 0.1
optimizer: AdamW (β1=0.9, β2=0.95)
ppo_clip_threshold: 0.2
kl_penalty_beta: 0.01 (7B/13B) or 0.005 (34B/70B)

Compute: 3.3M GPU hours on NVIDIA A100-80GB (Total for all models)

Comparison to Prior Work

vs. Llama 1: Doubled context length (4k), +40% pretraining data (2T tokens), GQA for larger models.
vs. MPT/Falcon: Llama 2 70B outperforms widely on coding and reasoning benchmarks.
vs. ChatGPT: Llama 2-Chat is open-weights and fine-tuned specifically for safety using a dual reward model approach (Safety RM + Helpfulness RM).

Limitations

Still trails GPT-4 significantly in coding and advanced reasoning benchmarks.
Knowledge cutoff limits ability to answer queries about recent events.
Potential for hallucinations and safety failures despite extensive tuning.
English-centric training data limits multilingual performance.

Reproducibility

Code: https://github.com/facebookresearch/llama

Model weights for 7B, 13B, and 70B are publicly released. 34B is not released. Code for inference and examples provided. Training code and dataset specifics (exact URLs/indices) are not fully open-sourced. SFT and RLHF data are internal/proprietary (Meta Safety/Helpfulness data).

📊 Experiments & Results

Evaluation Setup

Comprehensive benchmarking on standard academic datasets and human evaluation for chat capabilities.

Benchmarks:

MMLU (General knowledge and reasoning (5-shot))
GSM8K (Math word problems (8-shot))
HumanEval (Python coding (0-shot))
Meta Helpfulness/Safety Human Eval (Pairwise human preference vs. baselines) [New]

Metrics:

Accuracy
Pass@1
Win-rate against baselines (Human & GPT-4 Judge)
Statistical methodology: 95% confidence intervals reported for human evaluations (between 1% and 2%).

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Academic benchmarks comparing Llama 2 70B against closed-source models.
MMLU (5-shot)	Accuracy	70.0	68.9	-1.1
GSM8K (8-shot)	Accuracy	57.1	56.8	-0.3
HumanEval (0-shot)	Pass@1	48.1	29.9	-18.2
Comparison against open-source baselines showing significant improvements over the previous generation.
MMLU (5-shot)	Accuracy	63.4	68.9	+5.5
Big Bench Hard (3-shot)	Accuracy	43.5	51.2	+7.7

Main Takeaways

Quality Is All You Need: A small set of high-quality SFT data (~27k examples) is sufficient and superior to millions of lower-quality examples.
RLHF enables models to surpass the supervision provided by SFT alone, likely because judging outputs is easier than generating them.
Ghost Attention (GAtt) effectively solves the issue of context loss in multi-turn instructions, allowing the model to remember constraints like 'act as a pirate' throughout a conversation.
Separating Helpfulness and Safety reward models allows for better tuning of the tradeoff, preventing the model from becoming overly evasive (false refusals) or unsafe.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (specifically decoder-only)
Reinforcement Learning with Human Feedback (RLHF)
Proximal Policy Optimization (PPO)
Attention mechanisms (Multi-Head vs. Grouped-Query)

Key Terms

RLHF: Reinforcement Learning with Human Feedback—a method to align LLMs with human intent using preference data.

PPO: Proximal Policy Optimization—an RL algorithm used to update the model policy while preventing it from deviating too wildly from the previous version.

GQA: Grouped-Query Attention—an optimization where multiple query heads share a single key-value head to reduce memory bandwidth usage during inference.

Ghost Attention (GAtt): A fine-tuning technique where instructions are artificially concatenated to all user messages during training to improve multi-turn instruction following.

Rejection Sampling: A fine-tuning method where the model generates multiple outputs, the best is selected by a reward model, and the model is retrained on that 'gold' sample.

SFT: Supervised Fine-Tuning—training the model on high-quality instruction-response pairs.

RoPE: Rotary Positional Embeddings—a method for encoding position information in Transformers.

SwiGLU: A specific activation function used in the feed-forward layers of the Transformer.

RMSNorm: Root Mean Square Normalization—a normalization technique applied to the inputs of transformer layers.

KV cache: Key-Value cache—storing attention keys and values to speed up autoregressive generation.