TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

📝 Paper Summary

Reinforcement Learning for Generative Models Few-Step Diffusion Models Text-to-Image Generation

TDM-R1 enables few-step diffusion models to learn from non-differentiable rewards (like human preference or OCR) by using deterministic trajectories to accurately estimate intermediate rewards and training a dynamic surrogate reward model.

Core Problem

Existing RL methods for few-step diffusion models require differentiable rewards for backpropagation, excluding critical real-world signals like human binary preference, discrete object counts, or OCR correctness.

Why it matters:

Reliance on differentiable rewards prevents optimization against true user intent (e.g., 'Does this image look good?') or hard constraints (e.g., 'Does the text spell exactly X?').
Applying standard diffusion RL to few-step models fails because denoising-based RL objectives produce blurry results when steps are few.
Assigning rewards to intermediate noisy steps is difficult, often leading to high variance or biased estimates if only the final image reward is used.

Concrete Example: When generating an image with specific text, a standard few-step model might produce garbled letters. A non-differentiable OCR reward could correct this, but existing methods can't use it because OCR outputs aren't differentiable. TDM-R1 successfully optimizes this to produce correct text.

Key Novelty

Trajectory Distribution Matching with Surrogate Reward Learning (TDM-R1)

Leverages the deterministic nature of Trajectory Distribution Matching (TDM) to obtain unbiased reward estimates for intermediate noisy samples, treating the final reward as a probability.
Decouples learning into two parts: a Generator that maximizes a surrogate reward, and a Surrogate Reward model trained via group-based preference optimization to approximate the non-differentiable signal.
Uses a dynamic reference model (EMA of the reward model) to provide stable regularization without overfitting to noisy signals or becoming too rigid.

Architecture

The iterative training loop of TDM-R1, showing the interaction between the Few-Step Generator, the Non-Differentiable Reward oracle, and the Surrogate Reward learner.

Evaluation Highlights

Boosts GenEval benchmark performance from 61% to 92% using SD3.5-M, significantly surpassing the 80-NFE base model (63%) and GPT-4o (84%).
Achieves superior performance with only 4 NFEs (Number of Function Evaluations) compared to expensive 80-NFE base models.
Scales to the 6B-parameter Z-Image model, outperforming both its 100-NFE and few-step variants across in-domain and out-of-domain metrics.

Breakthrough Assessment

9/10

Significantly advances few-step generation by solving the non-differentiable reward problem. The performance jump on GenEval (beating GPT-4o with a few-step model) is remarkable and practically impactful.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement learning fine-tuning of a pre-trained few-step diffusion model using non-differentiable scalar rewards

Inputs: Text condition c and a pre-trained few-step model p_theta

Outputs: Optimized few-step generator p_theta producing high-reward images x_0

Pipeline Flow

Few-Step Generator (TDM) -> generates noisy trajectory samples
Non-Differentiable Reward -> scores final clean images
Surrogate Reward Learner -> learns step-wise rewards from trajectory samples and final scores
Generator Optimizer -> updates Generator using Surrogate Reward gradients

System Modules

Few-Step Generator

Generates images via deterministic K-step trajectory

Model or implementation: SD3.5-M or Z-Image (TDM-adapted)

Surrogate Reward Model

Provides differentiable learning signal to the generator

Model or implementation: Diffusion-parameterized reward network

Dynamic Reference Model

Provides regularization to prevent reward over-optimization

Model or implementation: EMA copy of the Surrogate Reward parameters

Novel Architectural Elements

Decoupled Generator and Surrogate Reward training loop where the surrogate is trained on-the-fly using deterministic trajectory samples
Use of deterministic TDM trajectories to assign mathematically justified intermediate rewards based on final image probability

Modeling

Base Model: SD3.5-M (Stable Diffusion 3.5 Medium) and Z-Image (6B parameters)

Training Method: Online Reinforcement Learning with Surrogate Reward

Objective Functions:

Purpose: Train the surrogate reward to rank groups of noisy samples correctly.

Formally: Group-based Bradley-Terry loss L_reward(phi) minimizing negative log-likelihood of preferred groups.
Purpose: Optimize the generator to maximize surrogate reward while staying close to the base distribution.

Formally: L_gen(theta) = E[-r_phi(x_tk, c) + beta * KL(student || teacher)].
Purpose: Maintain generation quality by matching score distributions.

Formally: Denoising score matching loss for the online fake score estimator.

Key Hyperparameters:

NFE: 4 (Number of Function Evaluations for inference)
beta: Controls regularization strength (implicitly defined in Eq 10)
clip_epsilon: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. DPOK/DDPO: TDM-R1 specifically targets *few-step* models (4 steps) vs. many-step standard diffusion, avoiding blurry results common when applying standard RL to few steps.
vs. ReFL: Uses a learned surrogate reward rather than simple weighting, enabling more precise gradient guidance.
vs. Mix-GRPO [not cited in paper]: Mix-GRPO mixes SDE and ODE steps to reduce variance; TDM-R1 relies entirely on deterministic ODE-like TDM trajectories for zero-variance intermediate reward assignment.

Limitations

Relies on the quality of the TDM pre-training; if the base few-step model is poor, RL might struggle.
Computational cost of training a surrogate reward alongside the generator (though inference is fast).
No specific discussion of failure modes for adversarial rewards.

Reproducibility

Code: https://github.com/Luo-Yihong/TDM-R1

Code is publicly available at https://github.com/Luo-Yihong/TDM-R1. The paper provides detailed derivations in the appendix. Pre-trained TDM checkpoints for SD3.5-M and Z-Image are assumed available.

📊 Experiments & Results

Evaluation Setup

Text-to-Image generation with reinforcement learning on specific non-differentiable objectives (Text rendering, Visual quality, Alignment)

Benchmarks:

GenEval (Complex instruction following and spatial reasoning)
DesignBench (Text rendering and visual design)

Metrics:

GenEval Score (Overall accuracy)
Text Rendering Accuracy (OCR-based)
Human Preference / Visual Quality
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GenEval	Overall Score	0.61	0.92	+0.31
GenEval	Overall Score	0.63	0.92	+0.29

Experiment Figures

Qualitative comparison of text rendering. TDM-R1 produces clear, correct text compared to baselines.

Visual comparison on complex prompts involving counts and spatial relationships.

Main Takeaways

TDM-R1 significantly improves instruction following and text rendering capabilities of few-step models.
The method scales effectively to large models (6B parameter Z-Image), showing it is not limited to smaller architectures.
Using deterministic trajectories (TDM) is crucial for accurate reward assignment; stochastic baselines perform worse.
The surrogate reward learning is essential; standard diffusion RL methods (like those treating rewards as weighted losses) fail to produce sharp images in the few-step regime.

📚 Prerequisite Knowledge

Prerequisites

Diffusion Models (Forward and Reverse processes)
Reinforcement Learning from Human Feedback (RLHF)
Bradley-Terry Model for preferences

Key Terms

TDM: Trajectory Distribution Matching—a few-step generative method that aligns student and teacher trajectories at the distributional level using deterministic sampling.

NFE: Number of Function Evaluations—the number of times the neural network is called to generate a single image; fewer is faster.

RLHF: Reinforcement Learning from Human Feedback—fine-tuning models to maximize a reward model derived from human preferences.

Surrogate Reward: A learned differentiable reward function that approximates the true non-differentiable reward, used to guide the generator's gradients.

EMA: Exponential Moving Average—a technique where model weights are updated as a moving average of past weights to stabilize training.

GenEval: A rigorous benchmark for evaluating text-to-image models on their ability to follow complex prompts and spatial reasoning.

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution.

ODE: Ordinary Differential Equation—in this context, refers to deterministic sampling paths in diffusion models (Probability Flow ODE).

SDE: Stochastic Differential Equation—sampling paths involving random noise injection.