GPT-3: Language Models are Few-Shot Learners

📝 Paper Summary

Large Language Models (LLMs) Few-Shot Learning In-Context Learning

Scaling autoregressive language models to 175 billion parameters enables them to perform a wide variety of NLP tasks via in-context learning without any gradient updates or fine-tuning.

Core Problem

Standard NLP paradigms require large task-specific datasets for fine-tuning, limiting applicability and potentially exploiting spurious correlations rather than true generalization.

Why it matters:

Collecting large labeled datasets for every new task is impractical and limits the versatility of language systems
Models fine-tuned on narrow distributions often generalize poorly out-of-distribution
Humans can learn new language tasks from just a few examples or instructions, a capability current NLP systems largely lack

Concrete Example: To perform a task like correcting grammar or critiquing a story, standard approaches require thousands of labeled examples. Humans only need a brief instruction (e.g., 'tell me if this sentence is happy or sad').

Key Novelty

GPT-3 (175B parameter In-Context Learner)

Scales model size to 175 billion parameters (10x previous non-sparse models) to test if meta-learning abilities improve with scale
Uses 'in-context learning' where the model is conditioned on a natural language instruction and/or a few demonstrations within its context window at inference time
Evaluates performance strictly without gradient updates, relying solely on the pre-trained model's pattern recognition abilities

Architecture

Conceptual diagram of Language Model Meta-Learning (In-Context Learning)

Evaluation Highlights

Achieves 86.4% accuracy on LAMBADA (few-shot), improving state-of-the-art by over 18%
Generates synthetic news articles that human evaluators struggle to distinguish from human-written ones (difficulty rating approx 50/100)
Matches SOTA open-domain fine-tuning performance on TriviaQA in the few-shot setting (71.2% accuracy)

Breakthrough Assessment

10/10

Defined the modern era of Generative AI. Demonstrated that massive scale leads to emergent in-context learning abilities, allowing models to solve tasks without fine-tuning.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive language modeling (predicting the next token) applied to downstream tasks via text interaction

Inputs: Context sequence containing task description and/or K examples (demonstrations)

Outputs: Completion of the sequence (the answer to the final example)

Pipeline Flow

Input Construction (Task description + K examples)
Forward Pass (Transformer layers processing context)
Prediction (Next token generation)

System Modules

GPT-3 (175B)

Process input context and generate completion

Model or implementation: Transformer (decoder-only) with alternating dense and locally banded sparse attention

Novel Architectural Elements

Alternating dense and locally banded sparse attention patterns in Transformer layers (similar to Sparse Transformer)
Massive scaling to 175 billion parameters (order of magnitude larger than previous 17B Turing-NLG)

Modeling

Base Model: GPT-3 175B (and 7 smaller variants down to 125M)

Training Method: Standard autoregressive language model pre-training

Objective Functions:

Purpose: Minimize negative log-likelihood of the next token.

Formally: Standard Cross-Entropy Loss.

Training Data:

Common Crawl (filtered) - 410 billion tokens
WebText2 - 19 billion tokens
Books1 - 12 billion tokens
Books2 - 55 billion tokens
Wikipedia - 3 billion tokens

Key Hyperparameters:

n_params: 175 Billion
n_layers: 96
d_model: 12288
+ 6 more
n_heads: 96
d_head: 128
batch_size: 3.2M tokens
learning_rate: 0.6e-4
context_window: 2048
training_tokens: 300 Billion

Compute: 3640 petaflop/s-days (for 175B model pre-training)

Comparison to Prior Work

vs. T5/RoBERTa: GPT-3 uses no gradient updates (fine-tuning) for downstream tasks, relying only on inference-time context
vs. Turing-NLG: 10x larger parameter count, enabling significantly stronger few-shot performance

Limitations

Performance lags behind SOTA on some tasks like ANLI, RACE, and QuAC
Generating long passages can still lose coherence or repeat content
Possibility of data contamination (test set overlap) due to training on broad internet data
Expensive inference and training costs due to massive model size
Potential for misuse (disinformation, spam) and bias inheritance from web data

Reproducibility

Code availability not provided. Dataset reconstruction (Common Crawl filtering) described in Appendix. Model weights not released. Test set contamination studies provided.

📊 Experiments & Results

Evaluation Setup

Zero-shot, One-shot, and Few-shot (10-100 examples) evaluation on NLP benchmarks without weight updates

Benchmarks:

LAMBADA (Modeling long-range dependencies / word prediction)
TriviaQA (Closed-book Question Answering)
SuperGLUE (General Language Understanding)
HellaSwag (Commonsense reasoning (ending selection))
Winograd (Pronoun resolution)

Metrics:

Accuracy
Perplexity
F1 Score
BLEU
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Language modeling and completion tasks show SOTA performance in zero/few-shot settings.
Penn Tree Bank (PTB)	Perplexity	35.8	20.5	-15.3
LAMBADA	Accuracy	68.0	86.4	+18.4
Closed-Book QA results demonstrate GPT-3 matches or exceeds fine-tuned systems without using external retrieval.
TriviaQA	Accuracy	68.0	71.2	+3.2
WebQuestions	Accuracy	44.7	41.5	-3.2
Translation results show strength in translating into English.
WMT'14 Fr-En	BLEU	35.0	39.2	+4.2
Winograd-style tasks show competitive performance.
Winograd	Accuracy	90.1	89.7	-0.4

Experiment Figures

Impact of model size and number of examples (K) on task performance

Main Takeaways

In-context learning performance scales with model size (power law), with the gap between zero, one, and few-shot performance growing as capacity increases
GPT-3 is highly proficient at 'closed book' QA, effectively storing knowledge in its parameters
Few-shot learning allows GPT-3 to be competitive with fine-tuned SOTA on many tasks without any gradient updates
Limitations persist in tasks requiring complex inference (NLI) or reading comprehension (RACE/QuAC)

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (specifically decoder-only)
Language modeling objectives (next-token prediction)
Zero-shot vs. Few-shot learning concepts

Key Terms

In-context learning: The inner loop of meta-learning where a model adapts to a task at inference time using only the context window, without weight updates

Few-shot: Providing the model with K examples (typically 10-100) in the context window before the target query

One-shot: Providing the model with exactly one example and a natural language task description

Zero-shot: Providing the model with only a natural language instruction and no examples

SOTA: State-of-the-art—the best performance currently achieved by any known method

Cloze task: A task where the model must fill in a missing word or phrase in a sentence (e.g., fill-in-the-blank)

Perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better performance

BLEU: Bilingual Evaluation Understudy—a metric for evaluating machine-generated text, commonly used in translation

Beam search: A search algorithm that explores a graph by expanding the most promising node in a limited set

Winograd Schema: A challenge requiring the resolution of an ambiguous pronoun in a statement, testing commonsense reasoning

Autoregressive: A model property where the output at current time step depends on previous time steps (predicting strictly left-to-right)