Feedback Efficient Online Fine-Tuning of Diffusion Models

📝 Paper Summary

Diffusion Model Fine-Tuning Black-box Optimization Reinforcement Learning

SEIKO fine-tunes diffusion models efficiently by interleaving reward modeling with diffusion updates and using KL regularization to explore only within the feasible data manifold.

Core Problem

Fine-tuning diffusion models to maximize properties (like bioactivity) usually requires costly ground-truth queries, and standard RL methods waste queries on invalid samples outside the feasible manifold.

Why it matters:

Evaluating ground truth rewards in domains like biology and chemistry often requires expensive, time-consuming wet lab experiments, making feedback efficiency critical.
Existing methods assume static reward models or don't optimize for feedback efficiency, leading to wasteful exploration of invalid/unnatural samples (e.g., physically impossible molecules).

Concrete Example: In drug discovery, a diffusion model might generate many molecules. A standard RL method might steer the model to generate a molecule that looks high-reward to a proxy model but is chemically unstable (invalid), wasting a wet-lab test.

Key Novelty

SEIKO (Optimistic Finetuning of Diffusion models with KL constraint)

Interleaves reward learning and diffusion model updates: acquires samples, updates a proxy reward model with uncertainty estimates, then updates the diffusion model using this proxy.
Uses a KL divergence constraint relative to the pre-trained model to ensure exploration stays within the 'feasible space' (manifold of valid data) while maximizing an optimistic reward estimate.

Architecture

The iterative loop of SEIKO: Sampling -> Labelling -> Reward Model Update -> Diffusion Model Update.

Evaluation Highlights

Outperforms baselines (PPO, classifier guidance) on ImageNet 64x64 aesthetic quality, achieving higher rewards with fewer queries.
In biological sequence design (TF Bind 8), SEIKO finds high-activity sequences faster than PPO and specialized baselines like AdaProx.
Significant gains in small molecule generation (QED optimization), maintaining high validity and diversity while maximizing properties.

Breakthrough Assessment

8/10

Strong theoretical grounding (regret guarantee) combined with effective empirical results across diverse, high-value domains (images, biology, chemistry). Directly addresses the critical bottleneck of query cost.

⚙️ Technical Details

Problem Definition

Setting: Online bandit setting with a pre-trained diffusion model on a large design space

Inputs: Pre-trained diffusion model, query budget K, unknown true reward function r(x)

Outputs: Fine-tuned diffusion model policy that generates high-reward samples

Pipeline Flow

Data Collection: Sample from current diffusion model
Feedback: Query ground truth reward for samples
Reward Learning: Update reward model and uncertainty estimator on augmented dataset
Policy Update: Update diffusion model drift using optimistic reward estimates and KL constraint

System Modules

Diffusion Sampler

Generates new candidate samples x using the current drift f(t,x)

Model or implementation: Diffusion Model (SDE solver)

Reward Oracle

Provides noisy ground-truth reward y for sample x

Model or implementation: Black-box function (e.g., aesthetic scorer, docking simulation)

Reward Model

Estimates mean reward and uncertainty to guide exploration

Model or implementation: Neural Network ensemble or Gaussian Process

Policy Optimizer

Updates diffusion drift to maximize optimistic reward - KL divergence

Model or implementation: Gradient-based optimizer (similar to policy gradient)

Novel Architectural Elements

Interleaved loop where the reward model is explicitly retrained online to guide the diffusion update, unlike static guidance.
Integration of an explicit uncertainty bonus (UCB-style) directly into the diffusion training objective to drive exploration.

Modeling

Base Model: Varies by domain (Stable Diffusion for images, specialized diffusion models for bio/chem)

Training Method: Policy Gradient with KL regularization and Optimistic Exploration

Objective Functions:

Purpose: Maximize expected reward while staying close to pre-trained distribution.

Formally: Maximize E[r(x) - alpha * KL(p || p_pre)]
Purpose: Train proxy reward model.

Formally: Minimize regression loss (MSE) on collected (x,y) data.
Purpose: Guide exploration via uncertainty.

Formally: Use r_hat(x) + beta * sigma(x) as the reward signal for the diffusion update.

Key Hyperparameters:

KL_coefficient_alpha: Varies (e.g., 0.1)
exploration_bonus_beta: Varies
number_of_rounds_K: Varies (e.g., 5-20 depending on experiment)

Comparison to Prior Work

vs. DDPO/DPOK: SEIKO is explicitly designed for the ONLINE setting with limited queries, using uncertainty quantification.
vs. Guidance: SEIKO updates the model weights rather than just steering inference, and handles exploration via uncertainty.
vs. GFlowNet: SEIKO leverages powerful pre-trained diffusion models directly.

Limitations

Relies on the quality of the uncertainty quantification in the reward model.
Computational cost of retraining the reward model and diffusion model at every iteration.
Theoretical guarantees depend on assumptions about the reward function lying in a specific function class (RKHS).

Reproducibility

Code: https://github.com/zhaoyl18/SEIKO

Code available at https://github.com/zhaoyl18/SEIKO. Paper includes theoretical proofs in Appendix.

📊 Experiments & Results

Evaluation Setup

Online maximization of a property with limited oracle queries

Benchmarks:

ImageNet 64x64 (Generate images with high aesthetic score)
TF Bind 8 (Generate DNA sequences with high binding activity)
Small Molecules (QM9/ZINC) (Generate molecules with high QED (drug-likeness))

Metrics:

Average Reward (of top samples or batch)
Maximum Reward
Diversity
Validity (for molecules)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SEIKO consistently achieves higher rewards with fewer queries compared to baselines across different domains.
ImageNet 64x64 (Aesthetic)	Aesthetic Score	5.9	6.2	+0.3
TF Bind 8	Binding Score	0.6	0.95	+0.35

Main Takeaways

SEIKO is more feedback-efficient than standard RL baselines (PPO, DDPO) and guidance methods.
The uncertainty bonus is crucial for avoiding local optima and finding higher reward regions.
The KL constraint successfully keeps generated samples valid (e.g., chemically feasible molecules) while optimizing the target property.

📚 Prerequisite Knowledge

Prerequisites

Diffusion Probabilistic Models (SDE formulation)
Reinforcement Learning (Policy Optimization)
Bandit Algorithms (Upper Confidence Bound)
KL Divergence

Key Terms

SEIKO: The proposed method: 'Optimistic Finetuning of Diffusion models with KL constraint'.

feasible space: The manifold of valid/meaningful data points (e.g., chemically valid molecules) defined by the support of the pre-trained model.

drift coefficient: The vector field guiding the diffusion process in the SDE formulation.

regret guarantee: A theoretical bound ensuring the algorithm performs nearly as well as an optimal strategy over time.

uncertainty model: A model that estimates the epistemic uncertainty of the reward prediction, used to encourage exploration of unknown regions.

UCB: Upper Confidence Bound—an algorithmic principle that chooses actions with high potential upside (mean + uncertainty) to balance exploration and exploitation.

PPO: Proximal Policy Optimization—a standard policy gradient RL algorithm.

SDE: Stochastic Differential Equation—a mathematical framework used to model the diffusion process.