A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More

📝 Paper Summary

LLM Alignment Reinforcement Learning from Human Feedback (RLHF)

This survey provides a structured taxonomy and detailed review of LLM alignment techniques, categorizing methods by reward modeling, feedback types, RL strategies, and optimization approaches.

Core Problem

LLMs pretrained on vast corpora often generate undesired, toxic, or untruthful responses because their next-token prediction objective does not inherently align with human values.

Why it matters:

Unaligned models can provide harmful instructions (e.g., how to commit illegal activities) or toxic content, posing safety risks.
The rapid proliferation of alignment techniques (RLHF, DPO, RLAIF) lacks a comprehensive survey that categorizes and compares them systematically.
Without alignment, even powerful models like GPT-4 would fail to reliably follow user intent or maintain helpfulness and honesty.

Concrete Example: A base LLM might respond to a query about 'how to make a bomb' with detailed instructions because it saw such data during training. An aligned model, trained via RLHF to value safety, would refuse the request or pivot to a safe educational topic.

Key Novelty

Taxonomy of 13 Categorical Directions for Alignment

Decomposes alignment into four pillars: Reward Model (explicit vs. implicit), Feedback (preference vs. binary, human vs. AI), RL (reference-based vs. free), and Optimization (online vs. offline).
Introduces a structured comparison framework to evaluate widely used methods like PPO, DPO, and RLAIF against 13 specific metrics (e.g., point vs. preference reward, length control).
Clarifies the evolution from standard RLHF (explicit reward modeling) to direct preference optimization (DPO, implicit reward) and AI-driven feedback (RLAIF).

Architecture

A hierarchical taxonomy of the 13 categorical directions for aligning LLMs with human preference.

Evaluation Highlights

InstructGPT (1.3B) outputs were preferred over 175B GPT-3 outputs, demonstrating that alignment can be more impactful than scale.
RLAIF achieved comparable performance to RLHF in summarization and helpful dialogue tasks while outperforming RLHF on harmlessness benchmarks.
RSO (Rejection Sampling Optimization) outperformed SLiC and DPO on summarization and dialogue tasks using a T5-large model.

Breakthrough Assessment

9/10

A highly necessary and comprehensive survey that organizes a chaotic and rapidly evolving field. It provides critical clarity on the relationships between disparate methods like PPO, DPO, and RLAIF.

⚙️ Technical Details

Problem Definition

Setting: Aligning a language model policy πθ to maximize reward r(x,y) while minimizing divergence from a reference policy πref

Inputs: Prompt x and generated responses y

Outputs: Optimized policy πθ(y|x) that generates high-reward responses

Pipeline Flow

Pre-training (Self-supervised learning on massive corpora)
Supervised Fine-Tuning (SFT) on instruction data
Reward Modeling (Training a predictor for human preferences)
Reinforcement Learning (Optimizing policy using PPO or DPO against rewards)

System Modules

SFT Model

Provides the starting point (π_ref) for alignment, ensuring the model can generate coherent text

Model or implementation: Transformer (Decoder-only)

Reward Model

Estimates the quality of a response based on human or AI preferences

Model or implementation: Scalar-output Transformer (often initialized from SFT model)

Policy Model

The LLM being aligned; updates its weights to maximize reward while staying close to SFT model

Model or implementation: Target LLM

Novel Architectural Elements

Categorization framework distinguishing Explicit vs. Implicit Reward Models
Distinction between Pointwise Reward (score per response) vs. Preference Probability Models (modeling P(y_w > y_l))

Modeling

Base Model: Varies by reviewed paper (e.g., GPT-3 1.3B/175B for InstructGPT, PaLM-2 for RLAIF-Google)

Training Method: Survey covers multiple methods: PPO, DPO, RLAIF, IPO, KTO

Objective Functions:

Purpose: Train reward model to predict human preference.

Formally: L(r) = -E[log(σ(r(x, yw) - r(x, yl)))]
Purpose: Optimize policy to maximize reward minus KL divergence (RLHF).

Formally: max E[r(x,y) - β * KL(πθ || πref)]
Purpose: Optimize policy directly on preferences without reward model (DPO).

Formally: L_DPO = -E[log σ(β * log(πθ(yw)/πref(yw)) - β * log(πθ(yl)/πref(yl)))]

Adaptation: Fine-tuning (full or PEFT depending on specific paper reviewed)

Key Hyperparameters:

kl_penalty_beta: 0.01-0.02 (InstructGPT), 0.001 (Anthropic RLHF)
ppo_clip_epsilon: 0.2 (common default, though paper cites specific paper values)

Compute: Varies; InstructGPT used 6B reward model for 1.3B-175B policies. SLiC-HF claims 0.25x memory of PPO.

Comparison to Prior Work

vs. PPO: DPO avoids training a separate reward model and is more stable/computationally efficient
vs. RLHF: RLAIF reduces cost by using AI feedback, achieving comparable results on some tasks
vs. DPO: DPOP adds a penalty to prevent preferred response likelihood from dropping, addressing a DPO failure mode

Limitations

DPO requires new preference data for every distribution shift, whereas RLHF reward models can be reused across prompts
DPO is sensitive to distribution shifts between the base model and preference data
Offline methods (DPO, SLiC) lack the exploration benefits of online RL (PPO), potentially limiting performance on out-of-distribution prompts
RLHF/PPO is complex to tune and computationally expensive due to loading multiple models (Policy, Reward, Reference, Value)

Reproducibility

Survey paper; does not introduce new code. References public implementations of reviewed methods (e.g., DPO, TRL library).

📊 Experiments & Results

Evaluation Setup

Survey reviews results from multiple papers across diverse settings (Summarization, Dialogue, Safety)

Benchmarks:

TruthfulQA (Truthfulness evaluation)
Anthropic HH (Helpful & Harmless) (Dialogue safety and helpfulness)
Reddit TL;DR (Summarization)

Metrics:

Win rate (vs. baseline)
Harmless rate
Truthfulness
KL Divergence (alignment tax)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative effectiveness of alignment methods reported in surveyed papers.
Human Evaluation (InstructGPT)	Win Rate	50.0	See Note	Positive
Summarization (Reddit TL;DR)	Win Rate	50.0	See Note	Positive

Main Takeaways

RLHF allows much smaller models (1.3B) to outperform significantly larger unaligned models (175B) in user preference.
RLAIF is a viable alternative to human feedback, matching performance on helpfulness and exceeding it on harmlessness tasks.
Direct alignment methods (DPO, SLiC-HF) offer significant computational efficiency gains over PPO by removing the separate reward model training step.
Iterative/Online methods generally outperform offline methods by addressing distribution shift issues.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals (policy, reward, value function)
Language Model pre-training and fine-tuning
Bradley-Terry model for preference modeling

Key Terms

RLHF: Reinforcement Learning from Human Feedback—aligning models using a reward model trained on human preferences

RLAIF: Reinforcement Learning from AI Feedback—using an AI system instead of humans to generate preference labels for alignment

PPO: Proximal Policy Optimization—an on-policy RL algorithm used to optimize the LLM against the reward model

DPO: Direct Preference Optimization—an algorithm that optimizes the policy directly from preference data without training an explicit reward model

SFT: Supervised Fine-Tuning—the initial phase of training where the model learns to follow instructions from labeled examples

Bradley-Terry model: A statistical model for estimating the probability that one item is preferred over another based on their latent scores

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a reference distribution, used to prevent the aligned model from drifting too far from the base model

Implicit Reward Model: A reward function that is mathematically derived from the optimal policy itself (as in DPO), bypassing the need for a separate reward network

Pointwise Reward: A single scalar score assigned to a specific prompt-response pair

Listwise Feedback: Feedback where a labeler ranks a list of K responses rather than just comparing a pair

Off-policy RL: Learning from data generated by a previous version of the policy (or a different policy entirely), rather than the current policy being trained