Extracting Reward Functions from Diffusion Models

📝 Paper Summary

Inverse Reinforcement Learning (IRL) Diffusion Models for Decision Making AI Safety and Interpretability

The paper extracts reward functions by training a neural network to align its gradients with the difference in score outputs between an expert diffusion model and a suboptimal base diffusion model.

Core Problem

Extracting reward functions (Inverse Reinforcement Learning) typically requires environment access, simulators, or expensive iterative policy optimization loops, which are computationally heavy and difficult to apply to diffusion models.

Why it matters:

Learning rewards allows for better robustness and generalization compared to simple imitation learning.
Extracting rewards helps in auditing and interpreting AI behavior (e.g., identifying biases or harmful preferences in large generative models).
Current MaxEnt IRL (Maximum Entropy Inverse Reinforcement Learning) methods assume access to the environment to train the policy, which is not always feasible.

Concrete Example: When analyzing a large image generation model, it is difficult to explicitly identify its biases. By comparing a standard model to a 'safe' version, this method extracts a reward function that explicitly assigns low scores to harmful content (violence/hate) without needing labeled classifiers.

Key Novelty

Relative Reward Extraction via Score Difference

Defines a 'relative reward function' that explains the difference in probability distributions between two diffusion models (e.g., an expert and a base model).
Extracts this reward by training a network to match the difference between the 'scores' (gradients of the log-density) of the expert and base diffusion models.
Avoids the need for iterative policy updates or environment interaction, leveraging the steerability of diffusion models.

Architecture

Illustration of the method extracting a relative reward function from two decision-making diffusion models.

Evaluation Highlights

Successfully recovers ground-truth reward functions (distance maps) in Maze2D navigation environments by comparing exploratory and goal-directed diffusion models.
Steering a low-quality base diffusion model with the learned reward function results in significantly increased performance on standard locomotion benchmarks (Hopper, HalfCheetah, Walker2D).
Generalizes to image generation by extracting a reward function that penalizes harmful content when comparing Stable Diffusion to a 'safer' version.

Breakthrough Assessment

7/10

Proposes a mathematically grounded and computationally efficient method for IRL with diffusion models. It removes the need for environment interaction, which is a significant theoretical and practical advantage, though relies on having two distinct pre-trained models.

⚙️ Technical Details

Problem Definition

Setting: Inverse Reinforcement Learning (IRL) in the context of diffusion-based generative policies.

Inputs: Two diffusion models: a base model (low-reward/exploratory) and an expert model (high-reward/optimal).

Outputs: A parameterized reward function (neural network) that explains the shift from base to expert behavior.

Pipeline Flow

Input: State/Action samples x
Group: Diffusion Score Estimation (Base & Expert Models)
Reward Extraction (Gradient Matching)

System Modules

Base Diffusion Model (Diffusion Score Estimation)

Models the prior or low-quality behavior distribution

Model or implementation: Diffusion Model (U-Net or similar)

Expert Diffusion Model (Diffusion Score Estimation)

Models the optimal or high-quality behavior distribution

Model or implementation: Diffusion Model (U-Net or similar)

Reward Network

Learns the scalar reward value whose gradient matches the score difference

Model or implementation: Feed-forward Neural Network

Novel Architectural Elements

Training objective that aligns the gradient of a scalar network directly with the difference of two vector-valued diffusion score functions.

Modeling

Base Model: Diffusion Model (Architecture dependent on domain, e.g., U-Net for images/Maze2D)

Training Method: Gradient Matching (Physics-Informed Neural Network style optimization)

Objective Functions:

Purpose: Extract reward by matching gradients.

Formally: Minimize Euclidean norm between gradient of reward network and the difference in score outputs: || grad(Reward) - (Score_Expert - Score_Base) ||^2

Compute: Not reported in the provided text

Comparison to Prior Work

vs. MaxEnt IRL: Does not require environment access or an inner loop of policy optimization; works directly with pre-trained diffusion models.
vs. Diffuser: This method extracts the reward function that explains the difference between two Diffuser models, rather than just using a reward to plan.

Limitations

Relies on the availability of two distinct diffusion models (base and expert).
Quality of extracted reward depends on the quality of the score estimation in the diffusion models.
Requires the score functions to be well-defined and differentiable.

Reproducibility

Code: https://www.robots.ox.ac.uk/~vgg/research/reward-diffusion/

Code and video are available at https://www.robots.ox.ac.uk/~vgg/research/reward-diffusion/. The paper utilizes standard environments (Maze2D, MuJoCo locomotion) and models (Stable Diffusion).

📊 Experiments & Results

Evaluation Setup

Reward extraction from pre-trained diffusion models in navigation, locomotion, and image generation.

Benchmarks:

Maze2D (Navigation / Path Planning)
Hopper / HalfCheetah / Walker2D (Locomotion (MuJoCo))
Stable Diffusion vs Safe Stable Diffusion (Image Generation / Safety)

Metrics:

Visual alignment of reward map (Maze2D)
Performance of base model when steered by extracted reward (Locomotion)
Qualitative assessment of reward on harmful vs harmless images
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Application of the method to image generation, comparing Stable Diffusion with a 'Safe' version.

Reward recovery in Maze2D environments.

Main Takeaways

The method successfully extracts a 'relative reward' that captures the goal-directed behavior differences between an exploratory base model and an expert model in Maze2D.
In high-dimensional locomotion tasks (Hopper, HalfCheetah, Walker2D), the extracted reward function is capable of steering a suboptimal base policy to achieve significantly higher performance, effectively recovering the expert's intent.
The approach generalizes beyond sequential decision-making to image generation, where it identifies a 'safety' reward function by comparing standard Stable Diffusion with a safe version, assigning lower rewards to violent/hateful content.

📚 Prerequisite Knowledge

Prerequisites

Diffusion Models (SDE formulation)
Score Matching
Reinforcement Learning as Probabilistic Inference

Key Terms

Diffusion Models: Generative models that create data by learning to reverse a gradual noise-addition process (denoising).

Score Function: The gradient of the log-probability density with respect to the data, which diffusion models are trained to approximate.

Inverse Reinforcement Learning (IRL): The problem of deriving a reward function that explains observed optimal behavior.

Classifier Guidance: A technique to steer the sampling process of a diffusion model using the gradients of a separate classifier or function (in this case, the extracted reward).

MaxEnt IRL: Maximum Entropy Inverse Reinforcement Learning—a framework that finds the reward function making observed behavior appear most probable under a maximum entropy distribution.

Relative Reward Function: A function quantifying the preference difference between two policies (diffusion models) rather than an absolute reward from the environment.