OPT: Open Pre-trained Transformer Language Models

📝 Paper Summary

Large Language Models (LLMs) Open Source Foundation Models

OPT is a suite of open-source decoder-only transformers (125M to 175B parameters) that replicates GPT-3 performance and capabilities while releasing full model weights, code, and training logbooks to democratize LLM research.

Core Problem

Access to full model weights of massive LLMs (like GPT-3) is restricted to a few well-resourced labs, hindering the broader research community's ability to study their limitations, robustness, and bias.

Why it matters:

Restricted access prevents independent verification of model capabilities and safety claims
Academic and civil society researchers cannot study the 'how and why' of LLM behaviors without weight access
Replicating these models is prohibitively expensive (compute/carbon) for most organizations, centralizing power

Concrete Example: While GPT-3 is available via paid APIs, researchers cannot inspect its weights to understand why it generates toxic content or how specific training data affects output. OPT releases the weights and training logs so researchers can audit the 175B model directly.

Key Novelty

Full Transparency Replication of GPT-3

Releases a 175B parameter model (OPT-175B) and smaller baselines trained to match GPT-3 architecture and performance
Provides a 'logbook' detailing the messy reality of training at scale, including hardware failures, loss divergences, and mid-flight hyperparameter adjustments
Achieves training with 1/7th the carbon footprint of GPT-3 by using newer hardware and efficient codebases

Evaluation Highlights

OPT-175B achieves zero-shot accuracy comparable to GPT-3 across 14 standard NLP tasks (avg ~55-60% acc)
In hate speech detection (ETHOS), OPT-175B outperforms Davinci (GPT-3 API) significantly (e.g., F1 0.812 vs 0.672 in few-shot multiclass)
Trained on 992 80GB A100 GPUs reaching 147 TFLOP/s utilization per GPU

Breakthrough Assessment

9/10

While architecturally standard (GPT-3 replica), the release of a 175B model's weights and the detailed engineering logbook was a massive milestone for open science and democratizing LLM research.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive language modeling (next-token prediction) on a massive text corpus

Inputs: Sequence of tokens x_1, ..., x_t

Outputs: Probability distribution over the vocabulary for the next token x_{t+1}

Pipeline Flow

Data Curation (RoBERTa + Pile + Reddit)
Tokenizer (GPT-2 BPE)
Transformer Decoder (125M to 175B params)
Optimization (AdamW + FSDP + Tensor Parallelism)

System Modules

Tokenizer

Convert text to tokens

Model or implementation: GPT-2 byte-level BPE

Transformer Decoder

Autoregressive density estimation

Model or implementation: OPT-175B (96 layers, 96 heads, d_model 12288)

Modeling

Base Model: OPT-175B (Decoder-only Transformer)

Trainable Parameters: 175 billion

Training Data:

180B tokens total
Union of RoBERTa corpus (BookCorpus, Stories), The Pile (subset), and PushShift.io Reddit
Filtered via MinhashLSH (Jaccard > 0.95)

Key Hyperparameters:

learning_rate: 1.2e-4 (peak)
batch_size: 2M tokens (global)
optimizer: AdamW (beta1=0.9, beta2=0.95)
+ 5 more
weight_decay: 0.1
gradient_clipping: 1.0 (reduced to 0.3 mid-flight)
sequence_length: 2048
dropout: 0.1
warmup_steps: 2000

Compute: 992 80GB A100 GPUs, achieving 147 TFLOP/s per GPU. Estimated 75 tons CO2eq.

Comparison to Prior Work

vs. GPT-3: OPT is open-weights; GPT-3 is API-only. Performance is comparable.
vs. Gopher: OPT uses standard Transformer; Gopher uses RMSNorm and relative positional encodings. OPT is open; Gopher is closed.
vs. PaLM: PaLM is larger (540B) and uses SwiGLU; OPT uses ReLU. PaLM outperforms OPT on most tasks.
+ 1 more
vs. BLOOM [not cited in paper]: BLOOM is multilingual; OPT is primarily English [BLOOM released after OPT]

Limitations

High propensity to generate toxic language and reinforce harmful stereotypes (worse than text-davinci-002 on some metrics)
Repetitive generation loops and factual incorrectness (hallucination)
Does not work well with declarative instructions compared to Instruction-tuned models (e.g., InstructGPT)
Training instabilities required manual intervention (lowering LR, gradient clipping changes)
Evaluation on MultiRC and WIC showed significant variance/anomalies compared to GPT-3 reported numbers

Reproducibility

Code: https://github.com/facebookresearch/metaseq

Highly reproducible. Code (metaseq) is open source. Weights for 125M-66B are open; 175B is available via request. Training logbook detailing instabilities is released. Dataset recipe is detailed, though some subsets (e.g., BookCorpus) are not hosted by Meta.

📊 Experiments & Results

Evaluation Setup

Standard NLP benchmarks (Zero-shot, One-shot, Few-shot) and Dialogue/Safety evaluations

Benchmarks:

SuperGLUE (BoolQ, CB, MultiRC, ReCoRD, RTE, WiC, WSC) (Language Understanding)
HellaSwag, StoryCloze, PIQA, ARC, OpenBookQA, Winograd, Winogrande (Commonsense Reasoning)
RealToxicityPrompts (Safety/Toxicity)
CrowS-Pairs, StereoSet (Bias)
ConvAI2, Wizard of Wikipedia (Dialogue)

Metrics:

Accuracy
Perplexity
F1 Score
Toxicity Probability
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot performance on standard NLP tasks shows OPT-175B is competitive with GPT-3 (Davinci).
HellaSwag	Accuracy	78.9	78.6	-0.3
PIQA	Accuracy	81.0	81.5	+0.5
ARC Challenge	Accuracy	51.4	44.3	-7.1
Hate Speech Detection (ETHOS, Few-shot multiclass)	F1	0.672	0.812	+0.14
CrowS-Pairs (Overall)	Score (Lower is better/fairer)	67.2	69.5	+2.3
ConvAI2	Perplexity	18.9	10.8	-8.1

Experiment Figures

Empirical Learning Rate schedule over iterations

Average Zero-shot Accuracy across 14 NLP tasks vs Model Parameters

Main Takeaways

OPT-175B successfully replicates GPT-3 capabilities, matching performance on most zero-shot and few-shot NLP tasks.
Hardware failures and training instabilities were significant hurdles; mid-flight interventions (LR changes, gradient clipping) were necessary to complete training.
The model exhibits high toxicity and bias, likely due to the inclusion of unmoderated social media data (Reddit/Pile) in the training corpus.
In dialogue settings, OPT-175B performs surprisingly well without supervision, matching fully supervised baselines on perplexity.
Evaluations on MultiRC and WIC revealed large discrepancies with GPT-3 reported numbers, suggesting sensitivity to exact prompt formatting or evaluation implementation details.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Decoder-only)
Large-scale distributed training (Tensor Parallelism, FSDP)
Zero-shot and few-shot prompting
Basic NLP evaluation metrics (Perplexity, Accuracy, F1)

Key Terms

OPT: Open Pre-trained Transformer—the suite of models released in this paper

FSDP: Fully Sharded Data Parallel—a memory efficiency technique that shards model parameters, gradients, and optimizer states across GPUs

Tensor Parallelism: Splitting individual tensor operations across multiple GPUs (e.g., Megatron-LM style) to fit large layers in memory

Dynamic Loss Scaling: Adjusting the scaling factor for loss values during mixed-precision training to prevent underflow of small gradients

Zero-shot learning: Evaluating a model on a task without providing any examples in the prompt

Few-shot learning: Evaluating a model by providing a few examples (demonstrations) in the prompt context

Pile: A large-scale, diverse open-source dataset for language model training

MinhashLSH: MinHash Locality Sensitive Hashing—an algorithm used for detecting and removing near-duplicate documents in the dataset

Megatron-LM: A highly optimized library for training large transformer models on NVIDIA GPUs

AdamW: A variant of the Adam optimizer that decouples weight decay from the gradient update