On The Fragility of Learned Reward Functions

📝 Paper Summary

Preference-based reward learning Reinforcement Learning from Human Feedback (RLHF)

Reward functions learned from preferences often fail to train new agents from scratch despite working for the original agent, a failure mode exacerbated by high-performing data collectors.

Core Problem

Learned reward functions frequently overfit to the specific trajectory distribution of the agent used to collect data, making them fragile when used to train new agents (relearning).

Why it matters:

Practitioners assume a learned reward captures the task intent and can be reused to train better policies or transfer to new architectures
Evaluating only the 'sampler' agent (trained alongside the reward) masks severe robustness issues, creating a false sense of security in RLHF systems

Concrete Example: In HalfCheetah, a reward model trained with 8 million steps of interaction produces a high-performing 'sampler' agent but fails catastrophically when used to train a fresh 'relearner' agent, yielding near-zero returns because the reward model overfits to the sampler's narrow, high-reward distribution.

Key Novelty

Systematic Evaluation of Relearning Failures

Formalize 'relearning evaluation' as a stress test: freezing a learned reward and training a randomly initialized agent from scratch to detect overfitting
Identify an 'anti-correlation' phenomenon where training the data-collecting agent *longer* (better performance) actually degrades the learned reward's ability to train new agents
Demonstrate that reward ensembles can mitigate these failures in tabular settings by smoothing out 'reward delusions' (spurious high rewards) in off-distribution regions

Architecture

Implicit workflow: Sampler generates data -> Reward Model trains -> Sampler updates. Then, separate Relearner trains on frozen Reward.

Evaluation Highlights

In HalfCheetah, increasing the data collector's training budget from 1M to 8M steps causes relearner performance to drop from ~4000 to <0 return
In the 'Stay Inside' tabular task, using a 5-member reward ensemble ensures 100% of relearners match sampler performance, whereas single rewards yield high failure rates

Breakthrough Assessment

7/10

Important empirical study revealing a critical flaw in standard RLHF evaluation protocols. While it suggests ensembles as a fix, the primary contribution is diagnosing the 'fragility' phenomenon.

⚙️ Technical Details

Problem Definition

Setting: Preference-based Reward Learning (learning R from pairwise comparisons of trajectory segments)

Inputs: Pairs of trajectory segments (σ1, σ2) and binary preferences y indicating which segment is better

Outputs: A learned reward function r̂_ψ that explains the preferences

Pipeline Flow

Trajectory Collection (Sampler)
Preference Elicitation (Synthetic)
Reward Inference (Bradley-Terry)
Policy Optimization (Sampler)
Relearning Evaluation (New Agent)

System Modules

Sampler Agent

Interact with environment to generate trajectory segments for preference labeling

Model or implementation: SAC Policy (Continuous) or Soft Q-Learning (Tabular)

Reward Model

Predict scalar rewards for state-action pairs based on human preferences

Model or implementation: MLP (2 layers, 256 units)

Relearner Agent

Test the robustness of the frozen learned reward by learning from scratch

Model or implementation: Same architecture as Sampler (initialized randomly)

Novel Architectural Elements

Relearning Evaluation Protocol: Explicitly separating the 'Sampler' (data collector) from the 'Relearner' to test reward function generalization

Modeling

Base Model: Custom MLP for Reward Network

Training Method: Deep Reinforcement Learning from Human Preferences (DRLHP)

Objective Functions:

Purpose: Optimize reward model to match preferences.

Formally: Minimize negative log likelihood of preferences under Bradley-Terry model: P(σ1 > σ2) = exp(Σr(σ1)) / (exp(Σr(σ1)) + exp(Σr(σ2)))
Purpose: Train policy to maximize learned reward.

Formally: Standard RL objectives (Soft Q-Learning or SAC) using r̂_ψ

Key Hyperparameters:

reward_network_hidden_layers: [256, 256]
learning_rate_sac: 0.0003
batch_size_sac: 256
+ 3 more
segment_length: 50
total_comparisons: 2000
optimizer: Adam

Compute: Not reported in the paper

Comparison to Prior Work

vs. PEBBLE: This work focuses on diagnostic evaluation of the learned reward via retraining, rather than just sampler performance
vs. Ibarz et al. (Atari): Expands their brief 'relearning' analysis into a systematic study of RL budget and ensemble effects
vs. EPIC [not cited in paper]: Uses EPIC distance as a metric but focuses on policy performance (relearning) rather than just reward distance

Limitations

Experiments limited to simple environments (HalfCheetah, Tabular Gridworld)
Uses synthetic ground-truth rewards for preferences rather than real human labelers
The exact mechanism for why ensembles fail in continuous control (mentioned in limitations but not fully explored) is unclear
No statistical significance tests reported for the anti-correlation trends

Reproducibility

Code availability is not provided in the paper (relies on standard libraries 'Imitation' and 'Stable-Baselines3'). Synthetic experiments use standard MuJoCo environments (HalfCheetah). Precise hyperparameters for reproduction are listed in Appendix A.

📊 Experiments & Results

Evaluation Setup

Synthetic preference learning where ground truth reward generates labels

Benchmarks:

HalfCheetah (seals version) (Continuous Control Locomotion)
Stay Inside (Tabular Gridworld Navigation) [New]
Tiny Room (Tabular Gridworld Navigation) [New]

Metrics:

Ground Truth Return (of Sampler)
Ground Truth Return (of Relearner)
EPIC Distance (Reward similarity metric)
Statistical methodology: Shaded regions in plots represent 90% confidence intervals over 50 relearning runs (10 reward runs x 5 relearners)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experiments in HalfCheetah demonstrate the 'anti-correlation' between sampler training duration (RL Budget) and the quality of the learned reward for relearning.
HalfCheetah	Ground Truth Return	12000	-500	-12500
HalfCheetah	Ground Truth Return	4000	-500	-4500
Experiments in the tabular 'Stay Inside' environment show how reward ensembles can mitigate relearning failures.
Stay Inside (Tabular)	Relearner Return	-500	1500	+2000

Experiment Figures

Learning curves for Sampler vs. Relearner in HalfCheetah across different RL budgets (0.5M to 8M steps).

Effect of Reward Ensembles in 'Stay Inside' tabular environment.

Main Takeaways

Higher performing samplers (trained longer) produce reward functions that are WORSE for training new agents, likely due to dataset concentration in high-reward regions
Relearning failures are caused by 'reward delusions'—spurious high rewards in off-distribution parts of the state space that trap new agents
Reward ensembles effectively mitigate relearning failures in tabular settings by reducing the variance/value of these off-distribution reward delusions
Standard evaluation (checking sampler performance) is insufficient; relearning evaluations are necessary to verify reward robustness

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL)
Bradley-Terry Model
Reward Learning from Human Preferences

Key Terms

Sampler: The policy/agent trained alongside the reward model during the reward learning process to generate trajectory segments for labeling

Relearner: A new, randomly initialized policy trained from scratch using the *frozen* learned reward function to test its robustness

Bradley-Terry model: A probabilistic model predicting the preference between two items based on the difference of their underlying rewards

Reward Hacking: When an RL agent exploits errors or loopholes in a misspecified reward function to get high rewards without performing the intended task

Soft Actor-Critic (SAC): An off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework

RL Budget: The total number of environment interactions (timesteps) the sampler agent is allowed during the reward learning phase

EPIC distance: A metric measuring the difference between two reward functions by canonicalizing them (invariant to shaping/scale) and computing the L2 norm over a coverage distribution

Reward Ensemble: Training multiple reward models on bootstrapped data and using their mean output as the reward signal to reduce uncertainty/variance