← Back to Paper List

David and Goliath: Small One-step Model Beats Large Diffusion with Score Post-training

Weijian Luo, Colin Zhang, Debing Zhang, Zhengyang Geng
International Conference on Machine Learning (2024)
MM RL Benchmark

📝 Paper Summary

Text-to-Image Generation Generative Model Alignment
Diff-Instruct* aligns one-step image generators using score-based divergence regularization instead of KL divergence, allowing a small 2.6B model to beat large 12B multi-step models in human preference metrics.
Core Problem
Aligning one-step models via traditional RLHF relies on KL divergence, which causes 'reward hacking'—models generate unrealistic, painting-like images to maximize reward scores rather than photo-realistic content.
Why it matters:
  • One-step generators are crucial for real-time applications on edge devices but often lack the alignment quality of expensive multi-step diffusion models
  • Current alignment methods force a trade-off: high human preference scores often come at the cost of image diversity and visual realism due to the mode-seeking nature of KL divergence
Concrete Example: When optimizing for human rewards using standard KL-regularized RLHF, a model might generate a distorted, oversaturated image that triggers a high score from the reward model (reward hacking) but looks indistinguishable from a weird painting to a human, losing the prompt's intended realism.
Key Novelty
Diff-Instruct* (DI*)
  • Replaces the traditional Kullback-Leibler (KL) divergence in RLHF with a general score-based divergence (specifically Pseudo-Huber), which preserves distribution diversity better than KL
  • Derives a mathematically equivalent tractable loss function that allows optimizing this intractable score-based objective using gradient descent
  • Incorporates Classifier-Free Guidance (CFG) into the alignment process via an 'implicit reward' term derived from the guidance scale
Architecture
Architecture Figure Algorithm 1 / Implicit Flow
Conceptual flow of the Diff-Instruct* training loop involving three models.
Evaluation Highlights
  • The 2.6B DI*-SDXL-1step model outperforms the 12B FLUX-dev (50-step) model on ImageReward and PickScore metrics on the Parti prompts benchmark
  • Achieves a state-of-the-art HPSv2.1 score of 31.19 among open-source models
  • Reduces inference latency by ~98%, requiring only 1.88% of the inference time needed by the 50-step FLUX-dev model
Breakthrough Assessment
9/10
Demonstrates that algorithmic innovation (score-based alignment) allows small, fast models to beat massive, slow state-of-the-art models in preference metrics. A significant efficiency win.
×