Hindsight PRIORs for Reward Learning from Human Preferences

📝 Paper Summary

Preference-based Reinforcement Learning (PbRL) Reward Learning World Models

Hindsight PRIOR improves reward learning from human preferences by using an attention-based world model to identify important states in hindsight and redistributing predicted returns to those states.

Core Problem

Current PbRL methods lack a credit assignment strategy, making it difficult to determine which parts of a trajectory contributed to a preference label, leading to data inefficiency and misaligned reward functions.

Why it matters:

Without credit assignment, many possible reward functions can explain a preference, requiring large amounts of feedback to disambiguate.
Misaligned reward functions (that fit training data but fail to generalize) lead to sub-optimal policies when deployment environments differ slightly.
Reducing human feedback requirements is critical for scaling alignment techniques to complex tasks where supervision is expensive.

Concrete Example: In Montezuma's Revenge, an agent might receive a positive preference for a long trajectory because of a single key jump. Standard PbRL might smear the reward across all steps (including walking), whereas the true reward should be concentrated on the setup and execution of the jump.

Key Novelty

PRIor On Reward (PRIOR)

Uses an attention-based world model to predict future states; the model's attention weights serve as a proxy for 'state importance' (states that are most predictive of the future).
Redistributes the total predicted return of a trajectory to individual state-action pairs proportional to this importance, creating a dense supervision signal for the reward model.
Combines the standard preference cross-entropy loss with this auxiliary 'hindsight' regression loss to guide reward learning.

Architecture

The Hindsight PRIOR framework. It illustrates the flow of a trajectory through a Policy, then to a World Model (TWM) to extract Attention Maps, which are then used to redistribute the Predicted Return (from the Reward Function) to create 'Target Rewards' for updating the Reward Function.

Evaluation Highlights

Recovers significantly more reward on average compared to baselines: +20% on MetaWorld and +15% on DMC (p < 0.05).
Achieves ≥ 80% success rate on MetaWorld tasks with as little as half the amount of feedback required by baselines.
Demonstrates robustness to incorrect/noisy preference feedback, maintaining higher performance than PEBBLE and other baselines when feedback error rates increase.

Breakthrough Assessment

7/10

Offers a logically sound and empirically effective solution to the credit assignment problem in PbRL using world models. The gains in sample efficiency and robustness are significant.

⚙️ Technical Details

Problem Definition

Setting: Preference-based Reinforcement Learning (PbRL) where a reward function is learned from trajectory pairs with preference labels.

Inputs: Dataset of preference triplets (tau_0, tau_1, y_p) where y_p indicates the preferred trajectory segment.

Outputs: Learned reward function r_psi(s, a) and optimal policy pi_phi.

Pipeline Flow

Group: Policy & Exploration -> Policy Agent interacts with Environment -> Buffer
Group: World Model -> Transformer World Model trains on Buffer -> Attention Maps
Group: Reward Learning -> Sample Pairs -> Compute Predicted Returns -> Redistribute via Attention -> Update Reward Function
Group: Policy Update -> Update Policy using Learned Reward

System Modules

Policy Agent

Interacts with the environment to generate trajectories and updates behavior based on learned rewards.

Model or implementation: SAC (Soft Actor-Critic) architecture

Transformer World Model (TWM)

Learns forward dynamics to predict next latent states; provides attention weights as proxy for state importance.

Model or implementation: Transformer XL based auto-regressive dynamics model

Reward Model

Predicts scalar rewards for state-action pairs; trained via preference cross-entropy and auxiliary hindsight loss.

Model or implementation: MLP (Multi-Layer Perceptron) or Transformer (architecture independent, though paper implies standard implementations)

Novel Architectural Elements

Integration of a Transformer-based World Model solely for extracting attention weights (state importance) to guide the Reward Model's credit assignment.
Hindsight PRIOR auxiliary loss which forces the learned reward distribution within a trajectory to match the world model's attention distribution.

Modeling

Base Model: TWM (Transformer-based World Model) for importance; SAC for policy.

Training Method: Iterative PbRL: Alternating between Reward Learning (supervised) and Policy Learning (RL).

Objective Functions:

Purpose: Ensure preferred trajectories have higher predicted returns.

Formally: L_CE = - sum_{(tau0, tau1, y) in D} y(0) * log P(tau0 > tau1) + y(1) * log P(tau1 > tau0)
Purpose: Redistribute predicted return to states based on importance (attention).

Formally: L_prior = || r_hat - (alpha * G_hat_psi) ||^2, where alpha are attention weights and G_hat is predicted return.
Purpose: Combined loss for reward function.

Formally: L_total = L_CE + lambda * L_prior

Training Data:

Trajectories collected during policy training.
Preference labels provided by a scripted teacher (simulated human) based on ground truth returns.

Key Hyperparameters:

feedback_frequency_K: Not explicitly reported in the paper
batch_size_M: Not explicitly reported in the paper
lambda (loss weight): Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. PEBBLE: PRIOR adds the auxiliary hindsight loss based on world model attention; PEBBLE relies solely on Cross-Entropy.
vs. Kim et al. (2023): PRIOR decouples the importance mechanism (World Model) from the Reward Model architecture, whereas Kim et al. require a specific Transformer reward model.
vs. SURF/MR: PRIOR focuses on credit assignment via dynamics consistency rather than data augmentation.

Limitations

Relies on the assumption that states predictive of future dynamics (attention) are semantically 'important' for human preference, which may not always hold.
Adds computational overhead of training a world model alongside the policy and reward model.
Hyperparameters for the auxiliary loss weight are not extensively analyzed in the main text.

Reproducibility

Code: https://github.com/apple/ml-rlhf-hindsight-prior

Code is publicly available at https://github.com/apple/ml-rlhf-hindsight-prior. The paper mentions specific baselines (PEBBLE, SURF, MR) and benchmarks (MetaWorld, DMC). Hyperparameters for the specific loss weighting (lambda) are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Simulated robotic control tasks using synthetic preferences (oracle teacher).

Benchmarks:

MetaWorld (Robotic manipulation (e.g., Hammer, Sweep))
DeepMind Control Suite (DMC) (Locomotion (e.g., Walker, Quadruped))

Metrics:

Success Rate (MetaWorld)
Episode Return (DMC)
Reward Recovery (Correlation/Error w.r.t ground truth)
Statistical methodology: p-values reported (< 0.05) for significance claims.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MetaWorld	Recovered Reward	Not reported in the paper	Not reported in the paper	Not reported in the paper
Success rate comparisons on MetaWorld tasks show Hindsight PRIOR achieving higher performance with less feedback.
MetaWorld	Success Rate	Not reported in the paper	80	Not reported in the paper

Main Takeaways

Integrating state importance from a world model significantly improves reward recovery (alignment with ground truth) compared to standard PbRL methods.
The method is highly sample efficient, achieving high success rates with reduced human feedback budgets (e.g., half the feedback on MetaWorld).
Hindsight PRIOR is robust to noisy teachers, degrading less than baselines when preference labels are incorrect.
Ablations confirm that the 'Hindsight' component (using future info via the world model) is crucial; forward-only importance is less effective.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, returns, policies)
Preference-based RL (Bradley-Terry model, cross-entropy loss)
Transformer architecture (Attention mechanisms)
World Models (Forward dynamics prediction)

Key Terms

PbRL: Preference-based Reinforcement Learning—learning a policy using feedback (preferences) rather than a pre-defined reward signal.

Credit Assignment Problem: The challenge of determining which specific actions or states in a sequence are responsible for the final outcome (reward or preference).

TWM: Transformer-based World Model—a specific architecture used to model environment dynamics using attention mechanisms.

Hindsight: Analyzing events after they have occurred; here, determining which states were important for a trajectory after observing the full sequence.

Bradley-Terry model: A statistical model used to predict the probability that one item is preferred over another based on their underlying values (returns).

Return Redistribution: A technique to re-allocate the total return of a trajectory to its constituent steps based on some criteria (here, attention weights).

PEBBLE: A baseline PbRL algorithm that uses unsupervised pre-training and off-policy learning.

MetaWorld: A benchmark suite of robotic manipulation tasks.

DMC: DeepMind Control Suite—a set of physics-based simulation tasks for RL.