Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

📝 Paper Summary

Text-to-Image Generation Preference Optimization

LPO repurposes the pre-trained diffusion model itself as a noise-aware reward model to perform fast, step-level preference optimization directly in the latent space without expensive image decoding.

Core Problem

Existing methods use Vision-Language Models (VLMs) as reward models, which require converting noisy internal states into clean pixels at every step—a slow process that produces unreliable, blurry images at high noise levels.

Why it matters:

Pixel-level Reward Models (PRMs) must decode latent representations to pixels, slowing down training by 2.5-28x compared to the proposed method
VLMs trained on clean images fail to accurately judge the highly noisy, blurry images generated during the early steps of diffusion, leading to poor optimization at those stages
Current reward models lack sensitivity to timesteps, making it difficult to assess generation progress contextually

Concrete Example: At timestep t=900 (highly noisy), a standard reward model sees a meaningless blur after decoding, yet must predict if the final image will be good. It fails because the blur is out-of-distribution for the VLM. In contrast, LPO uses the diffusion model's own internal features, which are naturally trained to understand these noisy states.

Key Novelty

Latent Preference Optimization (LPO) with Latent Reward Model (LRM)

Repurposes the diffusion model (U-Net/DiT) as a reward model (LRM) by extracting visual features from noisy states and comparing them with text prompts, avoiding VAE decoding
Introduces Multi-Preference Consistent Filtering (MPCF) to clean training data, ensuring 'winning' images are superior in both aesthetics and text alignment so preference holds even under noise
Visual Feature Enhancement (VFE) injects text-alignment information into the reward model by contrasting conditional and unconditional feature representations

Architecture

Overview of Latent Reward Model (LRM) architecture and Latent Preference Optimization (LPO) pipeline.

Evaluation Highlights

Achieves 10-28x training speedup over Diffusion-DPO and 2.5-3.5x over SPO (Step-by-step Preference Optimization) by eliminating VAE decoding
Outperforms SDXL Base with a 67.48% win-rate on Pick-a-Pic v1 test set, surpassing SPO (64.21%) and Diffusion-DPO (65.25%)
Reduces GPU memory usage significantly: LPO requires ~32GB vs Diffusion-DPO's ~68GB for SDXL training

Breakthrough Assessment

8/10

Significantly addresses the computational bottleneck of reinforcement learning for diffusion models by moving the reward signal to latent space. The speedups are substantial (order of magnitude) while maintaining or improving quality.

⚙️ Technical Details

Problem Definition

Setting: Step-level preference optimization of diffusion models using a learned reward function

Inputs: Noisy latent image x_t at timestep t, text prompt p

Outputs: Preference score S(x_t, p)

Pipeline Flow

Sampler: Generates candidate latent pairs from current policy
Latent Reward Model (LRM): Scores candidates in latent space
Optimization: Updates diffusion weights via DPO loss

System Modules

Sampler

Generates multiple noisy latent candidates from the same starting point x_{t+1}

Model or implementation: SD1.5 / SDXL / SD3 (Policy Model)

Latent Reward Model (LRM)

Predicts preference scores for noisy latents without decoding

Model or implementation: Modified U-Net (SD1.5/SDXL) initialized from pre-trained weights

Loss Calculation

Selects winning/losing latents based on scores and computes DPO gradient

Model or implementation: DPO Loss Function

Novel Architectural Elements

Latent Reward Model (LRM) leveraging U-Net intermediate features directly for scoring
Visual Feature Enhancement (VFE) module injecting alignment signals into visual features via uncond/cond difference

Modeling

Base Model: SDXL (Stable Diffusion XL) and SD1.5

Training Method: Latent Preference Optimization (LPO) - a variant of DPO adapted for latent space

Objective Functions:

Purpose: Optimize diffusion model to prefer higher-scoring latent trajectories.

Formally: Min -E [log sigmoid( beta * (log(p_theta(w)/p_ref(w)) - log(p_theta(l)/p_ref(l))) )]
Purpose: Train LRM to predict human preferences.

Formally: Bradley-Terry model loss L_LRM = -E [log sigmoid( S(x_w) - S(x_l) )]

Training Data:

LRM Training: Pick-a-Pic v1 (filtered via MPCF)
LPO Training: Prompt sets from Pick-a-Pic v1

Key Hyperparameters:

learning_rate: 1e-5 (LRM), 2e-6 (LPO-SD1.5), 1e-6 (LPO-SDXL)
batch_size: 64 (LRM), 16 (LPO-SD1.5), 8 (LPO-SDXL)
beta: 2000 (LPO Regularization)
+ 3 more
threshold_range: [0.02, 1.0] (Dynamic threshold th_min, th_max)
VFE_scale_gs: 2.0
temperature_tau: 0.01

Compute: LPO Training Time (SDXL): 5.2 GPU hours on 8x A800 (vs 144.5h for Diffusion-DPO). LRM Training Time: 12.5 GPU hours.

Comparison to Prior Work

vs. SPO: LPO avoids VAE decoding (2.5-3.5x faster) and handles high-noise steps where SPO's pixel reward model fails
vs. Diffusion-DPO: LPO performs online sampling in latent space vs offline data usage, reducing distribution shift
vs. D3PO: LPO uses a latent reward model derived from the diffusion U-Net, whereas D3PO uses a standard pixel-based reward model

Limitations

Depends on the quality of the LRM, which itself is trained on limited preference data (Pick-a-Pic)
MPCF filtering reduces dataset size significantly (from ~580k to ~140k pairs), potentially reducing diversity
LPO training is sensitive to the dynamic threshold hyperparameters
Requires access to a pre-trained diffusion model architecture for the LRM initialization (homogeneous setup preferred)

Reproducibility

Code: https://github.com/Kwai-Kolors/LPO

Code and models available at https://github.com/Kwai-Kolors/LPO. LRM training requires filtered Pick-a-Pic dataset (MPCF strategy described in Table 1). LPO uses standard DPO loss formulation but inputs are latents.

📊 Experiments & Results

Evaluation Setup

Text-to-Image Generation Preference Alignment

Benchmarks:

Pick-a-Pic v1 Test Set (Human Preference Prediction)
HPS v2 Test Set (Aesthetic/Alignment Evaluation)

Metrics:

Win-Rate (vs SDXL Base)
PickScore
HPS v2 Score
ImageReward
Aesthetic Score
CLIP Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis against state-of-the-art preference optimization methods on SDXL showing superior win-rates and metric scores.
Pick-a-Pic v1 Test Set	Win-Rate vs SDXL Base	64.21	67.48	+3.27
Pick-a-Pic v1 Test Set	PickScore	58.07	59.21	+1.14
HPS v2 Test Set	HPS v2 Score	31.06	32.61	+1.55
SDXL Training	Training Time (GPU Hours)	144.5	5.2	-139.3
SDXL Training	GPU Memory (GB)	68	32	-36

Experiment Figures

Radar chart comparing LPO against baselines (SDXL, Diffusion-DPO, D3PO, SPO) across multiple metrics (PickScore, HPSv2, Aesthetic, etc.)

Main Takeaways

LPO provides a massive training speedup (up to 28x) by operating purely in latent space, removing the VAE bottleneck.
LPO aligns models better with human preference than pixel-based methods (SPO, D3PO), particularly due to better handling of high-noise steps.
The MPCF data filtering strategy is crucial; without it, the reward model struggles with inconsistent preference signals (e.g., winning image having worse aesthetics).
The method generalizes across architectures (U-Net based SD1.5/SDXL and DiT based SD3).

📚 Prerequisite Knowledge

Prerequisites

Diffusion Models (Forward/Backward process, U-Net/DiT architectures)
Latent Space vs. Pixel Space (VAE)
Direct Preference Optimization (DPO)
Classifier-Free Guidance (CFG)

Key Terms

LRM: Latent Reward Model—A reward model built from the diffusion backbone itself that predicts preferences directly from noisy latent images

PRM: Pixel-level Reward Model—Standard reward models (like CLIP) that require pixel inputs, necessitating VAE decoding during diffusion training

LPO: Latent Preference Optimization—The proposed training framework that uses LRM to optimize the diffusion model entirely in latent space

MPCF: Multi-Preference Consistent Filtering—A data cleaning strategy ensuring winning images in a pair outperform losers in multiple metrics (Aesthetics, CLIP score) to guarantee robust preference ordering under noise

VFE: Visual Feature Enhancement—A module in LRM that enhances feature focus on text-image alignment by computing the difference between conditional and unconditional intermediate features (similar to CFG)

SPO: Step-by-step Preference Optimization—A baseline method that optimizes step-wise preferences but operates in pixel space

VAE: Variational Autoencoder—The component in Latent Diffusion Models that compresses images into latent space and decodes them back to pixels