Evaluation Setup
Reward extraction from pre-trained diffusion models in navigation, locomotion, and image generation.
Benchmarks:
- Maze2D (Navigation / Path Planning)
- Hopper / HalfCheetah / Walker2D (Locomotion (MuJoCo))
- Stable Diffusion vs Safe Stable Diffusion (Image Generation / Safety)
Metrics:
- Visual alignment of reward map (Maze2D)
- Performance of base model when steered by extracted reward (Locomotion)
- Qualitative assessment of reward on harmful vs harmless images
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- The method successfully extracts a 'relative reward' that captures the goal-directed behavior differences between an exploratory base model and an expert model in Maze2D.
- In high-dimensional locomotion tasks (Hopper, HalfCheetah, Walker2D), the extracted reward function is capable of steering a suboptimal base policy to achieve significantly higher performance, effectively recovering the expert's intent.
- The approach generalizes beyond sequential decision-making to image generation, where it identifies a 'safety' reward function by comparing standard Stable Diffusion with a safe version, assigning lower rewards to violent/hateful content.