Reward-Directed Conditional Diffusion: Provable Distribution Estimation and Reward Improvement

📝 Paper Summary

Conditional Diffusion Models Generative AI Control Theoretical Deep Learning

RCGDM leverages pseudo-labeled conditional diffusion to generate high-reward samples, theoretically proving this process recovers data subspaces and acts like off-policy bandit learning to balance reward maximization against distribution shift.

Core Problem

Directing generative models to maximize rewards (e.g., safety, aesthetic scores) often forces them off the data manifold, creating a conflict between sample fidelity and reward optimization.

Why it matters:

Blindly maximizing rewards in applications like protein design or safe RL can produce invalid or dangerous samples that fail to respect physical or logical constraints
Existing heuristic guidance methods lack statistical guarantees regarding how well they approximate the target distribution or how much the reward actually improves
Balancing exploration (finding high rewards) and exploitation (staying near data) is theoretically opaque in diffusion models

Concrete Example: In text-to-image generation, increasing the guidance for a 'colorful' image too much (high target reward) might destroy the image structure, resulting in a chaotic, oversaturated blob rather than a high-quality colorful photo.

Key Novelty

Reward-Directed Conditional Diffusion (RCGDM) with Subspace Analysis

Treats reward-directed generation as a semi-supervised problem: learns a reward model on labeled data to pseudo-label massive unlabeled data for conditional training
Theoretically proves that the conditional score network implicitly learns the low-dimensional latent subspace of the data, enabling high-fidelity generation
Establishes a formal connection between generative diffusion and off-policy bandit learning, bounding the reward gap using bandit regret terms

Architecture

Overview of the RCGDM pipeline: Reward Learning -> Pseudo Labeling -> Conditional Diffusion Training -> Guided Generation.

Evaluation Highlights

Proves subspace recovery error scales with $\tilde{O}(1/\sqrt{n_1})$, meaning the model effectively identifies the latent data manifold from unlabeled data
Demonstrates in text-to-image generation that increasing reward targets improves predicted rewards by ~6x (target 16 vs 1) but increases distribution shift error, validating the theoretical trade-off
Shows in simulation that on-support diffusion error remains linear for small reward shifts but becomes quadratic when the target reward exceeds the latent dimension ($a > d$)

Breakthrough Assessment

8/10

Significant theoretical contribution connecting diffusion models to bandit theory and subspace learning. Provides the first statistical guarantees for reward improvement in conditional diffusion, though empirical SOTA comparisons are secondary to the analysis.

⚙️ Technical Details

Problem Definition

Setting: Semi-supervised generation where data $x$ lies on a linear subspace $x=Az$ and labels $y$ are noisy reward measurements $y=f^*(x) + \xi$.

Inputs: Unlabeled dataset $D_{unlabel}$ and small labeled dataset $D_{label}$

Outputs: Generated population $\hat{P}(\cdot|\hat{y}=a)$ conditioned on target reward $a$

Pipeline Flow

Reward Learning (Regression on labeled data)
Pseudo-labeling (Augment unlabeled data with predicted rewards)
Conditional Score Matching (Train diffusion model on augmented data)
Guided Generation (Sample with target reward condition)

System Modules

Reward Estimator (Preprocessing)

Approximates the ground truth reward function using limited labeled data

Model or implementation: Ridge Regression (Linear case) or Neural Network (General case)

Pseudo-Labeler (Preprocessing)

Annotates the massive unlabeled dataset to enable conditional training

Model or implementation: Inference using $\hat{f}$

Conditional Score Network

Learns the conditional score function $\nabla \log p_t(x|y)$

Model or implementation: Encoder-Decoder Score Network $s_{V,\psi}$ (Theory) or UNet (Experiments)

Sampler

Generates new samples conditioned on a target reward value

Model or implementation: Discretized Backward SDE

Novel Architectural Elements

Encoder-Decoder Score Network architecture (in theoretical analysis) designed to explicitly model and recover the latent low-dimensional subspace $V$

Modeling

Base Model: Custom Encoder-Decoder Score Network (Theory) / Stable Diffusion v1.5 (Experiments)

Training Method: Conditional Denoising Score Matching

Objective Functions:

Purpose: Minimize difference between estimated score and true conditional score.

Formally: $\int_{t_0}^T \mathbb{E}[\| \nabla_{x'} \log \phi_t(x'|x) - s(x', y, t) \|^2] dt$

Key Hyperparameters:

early_stopping_time: $t_0$ (theoretical parameter)
noise_level: $\nu = 1/\sqrt{D}$ (theoretical noise for pseudo-labels)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Classifier-Guided Diffusion: RCGDM integrates the reward condition directly into the training (via pseudo-labels) or analysis, providing statistical guarantees on distribution recovery and reward improvement which heuristic guidance lacks
vs. Decision Diffuser: RCGDM focuses on the theoretical semi-supervised setting with provable subspace recovery, rather than purely empirical RL performance

Limitations

Theoretical results rely on the manifold hypothesis (data lies on linear subspace)
Boundaries of reward improvement are limited by the 'off-support' error; pushing rewards too high degrades sample quality
Assumes access to a reliable reward function estimator, which itself requires labeled data
Experiments are proofs-of-concept (simulation/visual) rather than large-scale competitive benchmarks

Reproducibility

No code provided. Theoretical assumptions (linear subspace, Gaussian latent) are clearly stated. Experimental details for Stable Diffusion (guidance levels, target values) are provided in text.

📊 Experiments & Results

Evaluation Setup

Simulation on synthetic data with linear subspace structure; Text-to-Image generation using Stable Diffusion directed by a classifier

Benchmarks:

Synthetic Linear Subspace (Data generation) [New]
Text-to-Image (Stable Diffusion) (Conditional Image Generation)

Metrics:

Subspace Angle $\angle(V, A)$
Off-support deviation $\|x_{\perp}\|_2$
Average Reward
Distribution Shift (Euclidean distance)
Statistical methodology: Standard deviation over 5 runs reported for simulation.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Simulation results verify theoretical scaling laws: reward improves linearly with target until distribution shift and off-support errors dominate.
Synthetic Linear Subspace	Average Reward	10.0	8.5	-1.5
Synthetic Linear Subspace	Off-support deviation	0.0	1.8	+1.8
Text-to-Image experiments with Stable Diffusion show the trade-off between maximizing the predicted reward and maintaining ground truth quality.
Stable Diffusion v1.5	Ground Truth Reward	0.5	3.5	+3.0
Stable Diffusion v1.5	Prediction Error (Pred - GT)	0.0	3.0	+3.0

Experiment Figures

Plot of Predicted vs Ground Truth rewards for generated images across different guidance levels and target values.

Main Takeaways

Theoretical analysis proves that conditional diffusion models can recover the underlying low-dimensional linear subspace of high-dimensional data.
The 'regret' (suboptimality) of generated samples decomposes into reward estimation error (bandit regret), on-support diffusion error, and off-support extrapolation error.
There is a phase transition in error scaling: when the target reward $a$ is less than latent dimension $d$, error is linear; when $a > d$, error becomes quadratic due to lack of data coverage.
Empirical results confirm that aggressive reward targeting successfully increases predicted rewards but eventually decouples from ground truth rewards due to distribution shift.

📚 Prerequisite Knowledge

Prerequisites

Diffusion Probabilistic Models (Score matching)
Stochastic Differential Equations (SDEs)
Linear Bandits / Off-policy Regret
Subspace Learning / Manifold Hypothesis

Key Terms

RCGDM: Reward-Conditioned Generation via Diffusion Models—the proposed algorithm utilizing pseudo-labels and conditional score matching

Score Matching: A method to learn the gradient of the log-probability density (the score) of data, used to train diffusion models

Subspace Angle: A metric $\angle(V, A)$ measuring the alignment between the learned representation subspace $V$ and the true data subspace $A$

Off-policy Regret: The difference between the optimal/target reward and the actual reward obtained by a policy (or generator) trained on historical data

Pseudo-labeling: Using a model trained on a small labeled dataset to predict labels for a larger unlabeled dataset, which are then treated as ground truth for training

Backward SDE: The reverse-time stochastic differential equation used to sample data from noise in diffusion models