LLaMA: Open and Efficient Foundation Language Models

📝 Paper Summary

Foundation Models Large Language Models (LLMs) Open-Source Models

LLaMA demonstrates that training smaller models on significantly more tokens than Chinchilla optimal scaling suggests yields state-of-the-art performance at a fraction of the inference cost, using only publicly available data.

Core Problem

Previous scaling laws (like Chinchilla) optimize for training compute but ignore inference budgets, leading to massive models that are expensive to serve and often trained on proprietary, inaccessible datasets.

Why it matters:

Serving large language models at scale is computationally prohibitive; a smaller model trained longer is cheaper at inference time
Reliance on undocumented or proprietary data hinders open research, reproducibility, and the study of bias/toxicity
Access to competitive LLMs has been limited to large industrial labs with massive compute resources

Concrete Example: While Chinchilla recommends training a 10B model on 200B tokens, LLaMA shows that a 7B model's performance continues to improve well past 1T tokens, eventually outperforming much larger models like GPT-3 (175B) on many tasks while running on a single GPU.

Key Novelty

Over-training smaller models on massive public datasets

Train models ranging from 7B to 65B parameters on trillions of tokens (far beyond the Chinchilla optimal point) to maximize inference efficiency rather than training efficiency
Construct a pre-training corpus entirely from publicly available sources (CommonCrawl, C4, GitHub, Wikipedia, ArXiv, etc.) compatible with open-sourcing
Integrate architectural improvements from disparate top models (PaLM's SwiGLU, GPT-Neo's Rotary Embeddings, GPT-3's pre-normalization) into a single stable architecture

Evaluation Highlights

LLaMA-13B outperforms GPT-3 (175B) on most benchmarks despite being 10x smaller, enabling single-GPU deployment
LLaMA-65B is competitive with the best available models, Chinchilla-70B and PaLM-540B, across common sense reasoning and reading comprehension tasks
LLaMA-65B achieves state-of-the-art zero-shot/few-shot performance on NaturalQuestions and TriviaQA, beating GPT-3 and Chinchilla

Breakthrough Assessment

10/10

Marked a pivotal shift in the field by proving open-data, smaller 'over-trained' models could beat proprietary giants, effectively democratizing LLM research and spawning the open-source LLM ecosystem.

⚙️ Technical Details

Problem Definition

Setting: Causal language modeling (next-token prediction)

Inputs: Sequence of tokens

Outputs: Probability distribution over the next token

Pipeline Flow

Input Tokenization (BPE with SentencePiece)
Embedding Layer
Transformer Block x N (Pre-norm → Attention → Norm → MLP)
Output Projection (Softmax)

System Modules

Tokenizer

Convert text to tokens

Model or implementation: SentencePiece (BPE)

Normalization (Transformer Layer)

Stabilize training dynamics

Model or implementation: RMSNorm

Positional Embedding (Transformer Layer)

Inject position information

Model or implementation: Rotary Positional Embeddings (RoPE)

Attention (Transformer Layer)

Contextualize tokens via self-attention

Model or implementation: Multi-head Causal Attention (xformers implementation)

Feed Forward (Transformer Layer)

Process information pointwise

Model or implementation: SwiGLU MLP

Novel Architectural Elements

Adoption of RMSNorm (Pre-norm) for stability
SwiGLU activation replacing ReLU
Rotary Embeddings (RoPE) replacing absolute encodings

Modeling

Base Model: LLaMA (7B, 13B, 33B, 65B)

Training Method: Standard Causal Language Modeling Pre-training

Objective Functions:

Purpose: Maximize likelihood of next token.

Formally: Minimize negative log-likelihood of the next token given previous tokens.

Trainable Parameters: 6.7B, 13.0B, 32.5B, 65.2B

Training Data:

English CommonCrawl (67%)
C4 (15%)
Github (4.5%)
Wikipedia (4.5%)
Gutenberg and Books3 (4.5%)
ArXiv (2.5%)
StackExchange (2.0%)
Total: ~1.4T tokens

Key Hyperparameters:

learning_rate: 3.0e-4 (7B/13B), 1.5e-4 (33B/65B)
batch_size: 4M tokens
weight_decay: 0.1
+ 3 more
gradient_clipping: 1.0
warmup_steps: 2000
optimizer: AdamW (beta1=0.9, beta2=0.95)

Compute: 2048 A100-80GB GPUs for approx. 21 days (65B model)

Comparison to Prior Work

vs. Chinchilla: Trained on significantly more data (1.4T vs ~1T tokens for Chinchilla) relative to size to optimize inference cost
vs. GPT-3: Uses architectural improvements (RoPE, SwiGLU, RMSNorm) and publicly available data only
vs. PaLM: Much smaller size (65B vs 540B) but comparable performance; strictly public data
+ 1 more
vs. OPT [not cited in paper as comparison target, but as baseline]: LLaMA outperforms OPT significantly at similar scales due to data quality and architectural choices

Limitations

Biases and toxicity present in web-scale training data are reproduced in the model (e.g., gender bias in WinoGender)
Lower performance on MMLU compared to PaLM-540B and Chinchilla-70B (likely due to less book data)
Propensity to hallucinate false information (low scores on TruthfulQA)
Carbon footprint of training is significant (approx 1,015 tCO2eq for the full suite)

Reproducibility

Code: https://github.com/facebookresearch/llama

Publicly available: Code for model architecture and inference. Model weights released to researchers. Missing: The exact pre-training dataset (though recipes are public, reprocessing CCNet/C4 requires significant effort).

📊 Experiments & Results

Evaluation Setup

Zero-shot and Few-shot evaluation on standard NLP benchmarks

Benchmarks:

Common Sense Reasoning (Multiple Choice (BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA))
Closed-book QA (Exact Match generation (Natural Questions, TriviaQA))
Reading Comprehension (QA (RACE))
Mathematical Reasoning (Generation (MATH, GSM8k))
Code Generation (Programming (HumanEval, MBPP))
MMLU (5-shot Knowledge)

Metrics:

Accuracy
Exact Match (EM)
Pass@1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Common Sense Reasoning results demonstrate LLaMA-65B competing with Chinchilla and PaLM, and LLaMA-13B outperforming GPT-3.
HellaSwag	Accuracy	78.9	79.2	+0.3
HellaSwag	Accuracy	80.8	84.2	+3.4
BoolQ	Accuracy	60.5	78.1	+17.6
Closed-book QA results show LLaMA-65B achieving state-of-the-art performance.
NaturalQuestions	Exact Match	21.2	23.8	+2.6
TriviaQA	Exact Match	55.4	68.2	+12.8
Results on MMLU show LLaMA-65B lagging slightly behind Chinchilla and PaLM, likely due to data mix.
MMLU	Average Accuracy	67.5	63.4	-4.1
Code generation results show LLaMA outperforming PaLM on comparable sizes.
HumanEval	pass@1	15.9	23.7	+7.8

Experiment Figures

Training loss curves over tokens for 7B, 13B, 33B, and 65B models

Evolution of downstream task performance (TriviaQA, HellaSwag, NQ, SIQA) during training

Main Takeaways

Performance of the 7B model continues to improve even after 1T tokens, contradicting Chinchilla scaling laws which suggest stopping earlier for optimal training efficiency
LLaMA-13B is a highly efficient model that outperforms GPT-3 (175B) on most benchmarks while running on a single V100 GPU
Using only publicly available data is sufficient to train state-of-the-art foundation models, removing the need for proprietary datasets
Brief instruction fine-tuning (LLaMA-I) significantly improves MMLU performance (63.4 -> 68.9 on 65B model)

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Vaswani et al.)
Scaling laws (Kaplan et al., Hoffmann et al./Chinchilla)
Tokenization (BPE)
Basic optimization (AdamW, cosine schedules)

Key Terms

SwiGLU: A gated activation function combining Swish and GLU (Gated Linear Unit) that generally improves transformer performance compared to ReLU

Rotary Embeddings (RoPE): A positional encoding method that rotates token embeddings in vector space based on their position, allowing better generalization to variable sequence lengths

RMSNorm: Root Mean Square Layer Normalization—a simplified normalization technique that stabilizes training by normalizing inputs based on their root mean square, ignoring mean centering

Chinchilla scaling laws: Empirical laws suggesting that for a fixed compute budget, model size and training data should scale equally; LLaMA deliberately violates this to optimize inference

BPE: Byte-Pair Encoding—a subword tokenization algorithm that iteratively merges the most frequent pair of adjacent characters/bytes

FlashAttention: An IO-aware exact attention algorithm that reduces memory usage and speeds up training by minimizing reads/writes between GPU HBM and on-chip SRAM

Massive Multitask Language Understanding (MMLU): A benchmark covering 57 subjects (STEM, humanities, etc.) designed to test world knowledge and problem-solving

Chain-of-thought: A prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer