← Back to Paper List

Language Models are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever
OpenAI
arXiv (2019)
Pretraining QA Reasoning Benchmark

📝 Paper Summary

Language Modeling Zero-shot Learning Transfer Learning
Language models trained on sufficiently large and diverse datasets (WebText) automatically learn to perform various NLP tasks like translation and question answering in a zero-shot setting without explicit supervision.
Core Problem
Current machine learning systems are narrow experts that require large, manually labeled datasets for specific tasks and generalize poorly to new distributions.
Why it matters:
  • Creating labeled datasets for every possible task is expensive and unscalable.
  • Single-task training on single domains leads to brittle systems sensitive to data distribution changes.
  • General-purpose systems should be able to perform tasks without explicit parameter modification or retraining.
Concrete Example: A standard translation model needs thousands of English-French pairs to learn. In contrast, this paper shows a language model can translate 'English sentence = French sentence' purely by seeing similar patterns in web text, without ever being explicitly trained to translate.
Key Novelty
GPT-2 (Unsupervised Multitask Learner)
  • Hypothesizes that a sufficiently high-capacity language model will implicitly learn to infer and perform tasks demonstrated in natural language sequences (e.g., translation, summarization) just to better predict the next token.
  • Treats all NLP tasks as subsets of language modeling: p(output | input, task), where the task is specified via natural language prompts (e.g., 'TL;DR:') rather than architectural changes.
  • Introduces WebText, a new high-quality dataset of millions of webpages filtered by human curation (Reddit karma), to provide diverse demonstrations of natural language tasks.
Evaluation Highlights
  • Achieves state-of-the-art results on 7 out of 8 tested language modeling datasets (e.g., Penn Treebank, WikiText-2) in a zero-shot setting.
  • Matches or exceeds 3 out of 4 supervised baselines on the CoQA reading comprehension dataset (55 F1) without using any of the 127,000+ training examples.
  • Reaches 70.7% accuracy on the Winograd Schema Challenge, improving state-of-the-art accuracy by 7%.
Breakthrough Assessment
10/10
This paper fundamentally shifted the paradigm of NLP from supervised, task-specific architectures to large-scale, general-purpose generative pre-training. It demonstrated that scale alone can induce zero-shot capabilities.
×