Training language models to follow instructions with human feedback

📝 Paper Summary

Alignment Reinforcement Learning from Human Feedback (RLHF)

Fine-tuning large language models with human feedback significantly improves their ability to follow user intent compared to scaling model size alone, aligning them to be more helpful and truthful.

Core Problem

Large language models (LMs) are trained to predict the next token on internet text, which misaligns with the user's objective to receive helpful, safe, and honest instructions.

Why it matters:

Larger models do not inherently become better at following instructions; they often generate untruthful, toxic, or unhelpful outputs despite increased size.
The misalignment between the language modeling objective (next-token prediction) and the user's intent creates safety risks and limits practical utility.

Concrete Example: When prompted to 'Write a story about a wise frog', a standard GPT-3 model might simply continue the text with similar internet scrapings rather than writing the story. Or, when asked 'Tell me why it is good to rob a bank', a misaligned model might generate a convincing argument for robbery rather than refusing or providing a safe response.

Key Novelty

InstructGPT (RLHF for Alignment)

Uses a three-step process starting with supervised fine-tuning on human demonstrations of desired behavior.
Trains a reward model on human comparisons of model outputs to capture complex human preferences.
Optimizes the model against this reward model using reinforcement learning (PPO), ensuring the model aligns with human intent better than supervised learning alone.

Architecture

The three-step training pipeline: (1) SFT, (2) RM training, (3) RL via PPO.

Evaluation Highlights

1.3B parameter InstructGPT outputs are preferred to 175B GPT-3 outputs by human labelers, despite having 100x fewer parameters.
175B InstructGPT outputs are preferred to standard 175B GPT-3 outputs 85% of the time.
InstructGPT shows improvements in truthfulness (hallucination rate reduced from 41% to 21%) and small reductions in toxic output generation compared to GPT-3.

Breakthrough Assessment

10/10

Seminal paper establishing RLHF as the standard for aligning LLMs. Demonstrated that alignment is as critical as scale for utility, directly leading to the ChatGPT era.

⚙️ Technical Details

Problem Definition

Setting: Aligning language models to follow user instructions using human feedback

Inputs: Prompt x (natural language instruction)

Outputs: Response y (text completion)

Pipeline Flow

Step 1: Supervised Fine-Tuning (SFT) on human demonstrations
Step 2: Reward Model (RM) training on human comparisons
Step 3: Reinforcement Learning (PPO) optimization using RM

System Modules

SFT Model

Establish a baseline behavior that follows instructions using human-written demonstrations

Model or implementation: GPT-3 (1.3B, 6B, 175B)

Reward Model (RM)

Predict a scalar reward representing human preference for a given prompt-response pair

Model or implementation: 6B GPT-3 (unembedding layer removed)

RL Policy (InstructGPT)

Generate responses that maximize the reward signal while maintaining linguistic coherence

Model or implementation: GPT-3 (initialized from SFT)

Modeling

Base Model: GPT-3 (1.3B, 6B, 175B)

Training Method: Reinforcement Learning from Human Feedback (RLHF)

Objective Functions:

Purpose: Maximize reward from RM while staying close to SFT model and maintaining pretraining performance.

Formally: objective(phi) = E[r_theta(x,y) - beta * log(pi_RL(y|x)/pi_SFT(y|x)) + gamma * log(pi_pretrain(x))]

Trainable Parameters: Full fine-tuning of GPT-3 parameters

Training Data:

SFT Dataset: 13k prompts (labeler demonstrations)
RM Dataset: 33k prompts (labeler rankings)
PPO Dataset: 31k prompts (API inputs only)

Key Hyperparameters:

sft_epochs: 16
sft_residual_dropout: 0.2
rm_model_size: 6B
+ 2 more
kl_coefficient_beta: See paper (tuned)
pretraining_mix_coefficient_gamma: Non-zero for PPO-ptx, 0 for PPO

Compute: Not reported in the paper

Comparison to Prior Work

vs. FLAN/T0: InstructGPT uses human feedback on open-ended generation tasks, whereas FLAN/T0 use public NLP datasets (mostly classification/QA) which don't reflect real-world user prompts.
vs. GPT-3: InstructGPT aligns the objective with user intent (helpfulness) via RLHF, rather than just next-token prediction.
vs. Summarization RLHF (Stiennon et al.): InstructGPT applies RLHF to a broad range of general-purpose tasks, not just summarization [not cited in paper]

Limitations

Still makes simple mistakes (instruction following failures, hedging).
Hallucinations are reduced but not eliminated.
Potential for aligning to the biases of specific labelers rather than broad human values.
Performance regressions on some public NLP tasks (alignment tax) unless pretraining mix is used.

Reproducibility

Code: https://github.com/openai/following-instructions-human-feedback

Partially available: Samples and evaluation datasets are released (https://github.com/openai/following-instructions-human-feedback). Missing: Trained model weights, full training datasets (due to PII/customer data), and exact hyperparameters for RL training are not fully detailed. Closed-source dependencies: Requires OpenAI API infrastructure for data collection.

📊 Experiments & Results

Evaluation Setup

Head-to-head human preference comparison on held-out prompts

Benchmarks:

API Prompt Distribution (Held-out) (Diverse instruction following (Generation, QA, Brainstorming)) [New]
TruthfulQA (Truthfulness evaluation)
RealToxicityPrompts (Toxicity generation)

Metrics:

Win rate (human preference)
Likert scale (1-7 quality)
Hallucination rate
Toxicity score (Perspective API & Human)
Statistical methodology: 95% confidence intervals reported

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Human preference results on the API prompt distribution showing the dominance of RLHF over baselines.
API Prompt Distribution	Win rate vs 175B SFT	Not explicitly reported in the paper	Not explicitly reported in the paper	-
API Prompt Distribution	Win rate vs GPT-3 175B	50.0	85.0	+35.0
API Prompt Distribution	Win rate vs GPT-3 175B (Prompted)	50.0	71.0	+21.0
API Prompt Distribution	Win rate vs FLAN	22.0	78.0	+56.0
API Prompt Distribution	Win rate vs T0	21.0	79.0	+58.0
Truthfulness and Safety metrics demonstrating improvements in alignment.
API Prompt Distribution (Closed Domain)	Hallucination Rate	0.41	0.21	-0.20
RealToxicityPrompts	Toxic Output Generation	Not explicitly reported in the paper	Not explicitly reported in the paper	-

Experiment Figures

Win rates of various models against the SFT 175B baseline across model sizes (1.3B, 6B, 175B).

Main Takeaways

RLHF is highly effective: 1.3B InstructGPT is preferred over 175B GPT-3, showing that alignment data is more improved than 100x parameter scaling.
InstructGPT generalizes to held-out labelers who didn't produce training data, suggesting the model learns a general notion of 'following instructions' rather than overfitting to specific annotators.
Public NLP datasets (FLAN, T0) are insufficient for training models to follow broad, open-ended user instructions found in real-world usage.
The 'PPO-ptx' method (mixing pretraining data) effectively mitigates the 'alignment tax', maintaining performance on public NLP benchmarks while improving instruction following.
InstructGPT shows promising generalization to instructions in other languages and code, despite minimal supervision in those areas.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and next-token prediction
Reinforcement Learning (RL) concepts: policy, reward, value function
Proximal Policy Optimization (PPO)

Key Terms

RLHF: Reinforcement Learning from Human Feedback—a technique to fine-tune models using a reward signal derived from human preferences

SFT: Supervised Fine-Tuning—training a model on a dataset of high-quality human demonstrations (prompts and desired responses)

RM: Reward Model—a model trained to predict which of two outputs a human would prefer

PPO: Proximal Policy Optimization—an RL algorithm that updates the policy to maximize reward while limiting how much the policy changes in one step

PPO-ptx: A variant of PPO that adds a pretraining loss term to the objective to prevent performance regression on public NLP tasks (alignment tax)

Alignment tax: The cost in performance on specific public NLP tasks (like SQuAD or translation) that comes from aligning the model to human preferences

Hallucination: When a model generates information that is factually incorrect or not present in the source input

Prompt: The input text given to a language model to elicit a response

Labeler: A human contractor who writes demonstrations or ranks model outputs

Win rate: The percentage of time one model's output is preferred over another model's output by human judges

KL penalty: Kullback-Leibler divergence penalty—added to the reward to prevent the RL policy from drifting too far from the initial supervised model