All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Preference Fine-Tuning (PFT)

Theoretical analysis suggests online and offline fine-tuning should be equivalent, but empirical gains from online RL likely stem from reducing the policy search space to candidates optimal for simple verifiers.

Core Problem

Despite theoretical arguments suggesting offline maximum likelihood estimation (MLE) should suffice for preference fine-tuning, complex two-stage online RL procedures consistently outperform offline methods in practice.

Why it matters:

Understanding why online RL works is crucial for designing more efficient fine-tuning algorithms
The standard two-stage pipeline (Reward Model + PPO) is computationally expensive compared to offline alternatives like DPO
Current justifications for online RL (like recovering from mistakes) do not straightforwardly apply to language modeling where tokens cannot be deleted

Concrete Example: In summarization tasks, offline methods like DPO often fail to match the quality of online methods (like online DPO), even when both use the same underlying data and model architectures, suggesting a hidden benefit to the generation-verification loop.

Key Novelty

The Verification-Generation Gap Hypothesis for RLHF

Proves theoretically that under ideal conditions (isomorphic reward/policy classes), online and offline fine-tuning optimize the same objective and should yield identical results
Hypothesizes that the empirical advantage of online RL comes from 'proper learning': it restricts the search to policies that are optimal for simple reward models (verifiers), effectively regularizing the solution space
Demonstrates that simply filtering data or training a reward model on the same data as the policy extracts more value than direct policy optimization

Architecture

A conceptual diagram illustrating the two-stage online fine-tuning process as a projection.

Evaluation Highlights

Online DPO outperforms standard offline DPO significantly on the TL;DR summarization task (winrate vs human reference)
Online DPO (using samples from an offline DPO model) outperforms the offline DPO model itself, despite using no new human data
Theoretical proof that optimal policies for online and offline PFT are identical when reward and policy classes are isomorphic

Breakthrough Assessment

8/10

Provides a strong theoretical un-ification of online and offline methods while offering a novel, empirically supported hypothesis (generation-verification gap) to explain the persistent practical gap.

⚙️ Technical Details

Problem Definition

Setting: Finite-horizon, reward-free Markov Decision Process (MDP) for language generation

Inputs: Prompt x (initial state)

Outputs: Completion y (trajectory of tokens)

Pipeline Flow

Reward Learning (Offline): Train Reward Model (RM) on preference data
Online RL (Online): Optimize Policy using feedback from RM on generated samples

System Modules

Reward Model

Assign scalar scores to prompt-completion pairs to guide the policy

Model or implementation: Pythia-based transformer with scalar head

Policy

Generate completions for prompts

Model or implementation: Pythia series (e.g., 2.8B)

Novel Architectural Elements

Theoretical framework treating Reward Models and Policies as isomorphic function classes to prove equivalence of objectives

Modeling

Base Model: Pythia series (specifically Pythia-2.8B for reported experiments)

Training Method: Online DPO (iterative generation and preference optimization)

Objective Functions:

Purpose: Maximize likelihood of preferred data over dis-preferred data.

Formally: L_DPO = -log sigma(beta * log(pi(yw)/ref(yw)) - beta * log(pi(yl)/ref(yl)))

Adaptation: Full fine-tuning

Training Data:

Reddit TL;DR summarization dataset

Key Hyperparameters:

beta: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper
sampling_count: 25 completions per prompt (for Online DPO)

Compute: Not reported in the paper

Comparison to Prior Work

vs. DPO: Shows Online DPO outperforms Offline DPO even when controlling for loss function and data, attributed to search space reduction
vs. PPO: Argues the benefit of PPO/Online methods is not 'learning from mistakes' but restricting policy search to simple verifiers

Limitations

Theoretical equivalence relies on idealized assumptions (exact optimization, isomorphic classes) that rarely hold perfectly in practice
Experiments focus on a single task (summarization) and model family (Pythia)
Does not strictly rule out other hypotheses (regularization, exploration) but argues they are insufficient explanations alone

Reproducibility

Code URL not provided in the paper. Experiments use standard datasets (TL;DR) and models (Pythia), but specific hyperparameters for the controlled DPO vs Online DPO comparison are not detailed beyond the sampling strategy.

📊 Experiments & Results

Evaluation Setup

Summarization of Reddit posts

Benchmarks:

TL;DR Summarization (Text Summarization)

Metrics:

Winrate vs. human-generated reference summaries (evaluated by GPT-4o)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Controlled experiments comparing Offline DPO and Online DPO on the TL;DR summarization task using Pythia-2.8B.
TL;DR Summarization	Winrate vs Human	0.52	0.58	+0.06
TL;DR Summarization	Winrate vs Human	0.52	0.61	+0.09

Experiment Figures

Bar chart comparing win rates of various fine-tuning methods against human references on the TL;DR dataset.

Main Takeaways

Online DPO consistently outperforms Offline DPO, contradicting the theoretical equivalence derived under idealized assumptions.
The performance gap exists even when controlling for the loss function (using DPO loss for both) and data source.
Iterative training (Online DPO starting from an Offline DPO model) yields the best performance, suggesting that the Reward Model contains 'more juice to squeeze' than the policy can extract via offline MLE alone.
The results support the hypothesis that the value of RL lies in 'proper learning'—finding policies optimal for simple verifiers—rather than just better regularization or exploration.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Maximum Likelihood Estimation (MLE)
Information Geometry (KL divergence projections)

Key Terms

RLHF: Reinforcement Learning from Human Feedback—a method to align models using a reward model trained on human preferences

PFT: Preference Fine-Tuning—fine-tuning models to generate outputs preferred by humans or scalers

DPO: Direct Preference Optimization—an offline method optimizing policy to satisfy preferences without an explicit reward model loop

PPO: Proximal Policy Optimization—an online RL algorithm often used in the second stage of RLHF

MLE: Maximum Likelihood Estimation—standard supervised learning objective maximizing the probability of data

Generation-Verification Gap: The concept that it is often computationally easier to verify a good solution (reward model) than to generate one (policy)

Proper Learning: Learning a hypothesis from a restricted class (e.g., policies optimal for some reward model) rather than any arbitrary function

Isomorphic Classes: When the set of functions representable by the policy class is mathematically equivalent to the set representable by the reward model class