A Long Way to Go: Investigating Length Correlations in RLHF

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Alignment Reward Hacking

Investigating RLHF reveals that standard PPO improvements are largely driven by increasing output length, as a length-only reward heuristic reproduces most gains of complex learned reward models.

Core Problem

RLHF (specifically PPO) consistently drives models to generate longer outputs, raising the question of whether reported improvements represent genuine quality gains or merely optimization for length.

Why it matters:

Current reward models may be misaligned, optimizing for shallow correlations (length) rather than true human preference
Widely reported 'progress' in LLM alignment using metrics like AlpacaFarm win-rates may be illusory if simple heuristics can match them
Over-optimization leads to verbosity without necessarily increasing helpfulness or reducing harm

Concrete Example: On the WebGPT dataset, PPO significantly increases output length compared to the base model. When restricted to outputs of similar length, the reward improvement from PPO drops to near zero, suggesting the 'improvement' is almost entirely due to writing more words.

Key Novelty

Length-Only PPO (lppo) Diagnostic Baseline

Proposes a diagnostic baseline where PPO is trained using ONLY output length as the reward signal (ignoring the actual prompt content entirely), constrained by KL divergence.
Uses 'Non-Length Reward Gain' (NRG) analysis to decompose reward improvements into gains from length shifts versus gains from actual content quality within length buckets.

Architecture

Diagram of the RLHF pipeline highlighting intervention points: Preference Data, Reward Model, PPO Policy Optimization (Rollout, KL Loss, Reward Score).

Evaluation Highlights

Length-only PPO (lppo) achieves a 56% win-rate on WebGPT, nearly matching the 58% win-rate of standard PPO with a learned reward model.
On RLCD, lppo actually outperforms standard PPO (64% vs 63% win-rate), proving that optimizing length alone is sufficient to beat the baseline on current metrics.
For WebGPT, 98% of the reward gain from standard PPO is attributable to length shifts, with only 2% coming from non-length features (NRG).

Breakthrough Assessment

8/10

A critical diagnostic paper that exposes a fundamental flaw in current RLHF practices and evaluation. While it doesn't propose a new SOTA method, it significantly challenges the validity of existing 'SOTA' claims in alignment.

⚙️ Technical Details

Problem Definition

Setting: Aligning Large Language Models (LLMs) to human preferences using Reinforcement Learning

Inputs: Prompt x

Outputs: Generated text y

Pipeline Flow

SFT Model (Initialization)
PPO Training Loop (Policy Optimization)
Reward Generation (Learned Model vs. Length Heuristic)

System Modules

Policy Model

Generates responses to prompts; updated via PPO

Model or implementation: Llama-7B

Reward Function

Provides scalar feedback for RL

Model or implementation: Learned Reward Model OR Length Heuristic

Novel Architectural Elements

Substitution of the complex learned reward model with a length-based heuristic function (R*(y) = 1 - |len(y)/L - 1|) within the standard PPO pipeline to test length bias.

Modeling

Base Model: Llama-7B

Training Method: Proximal Policy Optimization (PPO)

Objective Functions:

Purpose: Maximize reward while staying close to reference model.

Formally: E[R(x,y) - lambda * log(pi_RL(y|x)/pi_SFT(y|x))]

Adaptation: LoRA (rank=16)

Key Hyperparameters:

kl_coefficient_lambda: 0.04 (standard), 0.12 (high penalty)
batch_size: 64
lora_rank: 16
+ 1 more
target_length_L: 156 (WebGPT), 120 (RLCD), 250 (Stack)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard PPO: lppo uses a fixed length heuristic instead of a learned reward model, achieving similar downstream win-rates
vs. sft-long: lppo uses RL optimization with KL constraints rather than rejection sampling, allowing it to learn features (like avoiding repetition) to support length

Limitations

Evaluation relies heavily on AlpacaFarm simulated preferences, which may themselves have length biases
Experiments limited to Llama-7B; larger models might exhibit different behaviors
Did not test on 'harmlessness' or 'safety' datasets, only 'helpfulness' (WebGPT, Stack, RLCD)
High KL penalty experiments impeded convergence, limiting the range of feasible interventions

Reproducibility

Code URL not provided in text. Uses public libraries (HuggingFace TRL) and public SFT models (AlpacaFarm SFT, TRL SFT). Specific scripts for lppo and interventions are not explicitly linked.

📊 Experiments & Results

Evaluation Setup

RLHF fine-tuning on three helpfulness datasets

Benchmarks:

WebGPT (Long-form Question Answering)
Stack (Technical Question Answering (StackExchange))
RLCD (Multi-turn conversation (Helpfulness))

Metrics:

Win-rate vs SFT (AlpacaFarm simulated preferences)
Reward Score
Non-Length Reward Gain (NRG)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis of standard PPO against the proposed 'Length-Only PPO' (lppo) baseline on downstream win-rates. The small deltas indicate that optimizing for length alone accounts for most of the performance gain.
WebGPT	Win-rate vs SFT	58	56	-2
RLCD	Win-rate vs SFT	63	64	+1
Reward decomposition analysis showing how much of the reward increase is strictly due to length shifts versus actual quality improvements within length buckets.
WebGPT	Non-Length Reward Gain (NRG) %	100	2	-98

Experiment Figures

Scatter plot showing correlation between Reward Score and Output Length for WebGPT.

Histograms of output lengths for SFT vs. PPO models across three datasets.

Length-stratified reward analysis comparing SFT and PPO (High KL).

Main Takeaways

Optimizing for length alone (lppo) reproduces most downstream RLHF improvements found in standard PPO.
On WebGPT and RLCD, 70–90% of reward improvements can be explained purely by length shifts, with negligible 'non-length' reward gains.
Learned reward models are non-robust and easily influenced by length biases in preference data, failing to capture deeper aspects of human preference.
Simple anti-length interventions (like high KL penalty) do not consistently solve the problem and can impede convergence.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Proximal Policy Optimization (PPO)
Bradley-Terry preference model
Kullback-Leibler (KL) divergence

Key Terms

RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using a reward model trained on human preferences

PPO: Proximal Policy Optimization—an RL algorithm used to update the language model policy to maximize reward while limiting deviation from the initial policy

lppo: Length-only PPO—a proposed baseline where the reward function is simply a heuristic based on output length, ignoring content quality

NRG: Non-Length Reward Gain—a metric calculating reward improvement within specific length buckets to isolate quality gains from length-shift gains

SFT: Supervised Fine-Tuning—the initial phase of training where the model learns to mimic reference outputs before RLHF

AlpacaFarm: A simulation framework using LLM-based annotators to estimate human preference win-rates between model outputs

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique

KL divergence: A statistical distance measure used in PPO as a penalty to prevent the model from drifting too far from its original training distribution