← Back to Paper List

RB-Modulation: Training-Free Personalization of Diffusion Models using Stochastic Optimal Control

Litu Rout, Yujia Chen, Nataniel Ruiz, Abhishek Kumar, C. Caramanis, Sanjay Shakkottai, Wen-Sheng Chu
Google, Google DeepMind
arXiv.org (2024)
MM P13N

📝 Paper Summary

Text-to-Image Personalization Style Transfer Diffusion Model Control
RB-Modulation personalizes text-to-image models by treating the reverse diffusion process as a stochastic optimal control problem, modulating the drift to match a target style without training.
Core Problem
Existing training-free personalization methods struggle with accurately extracting styles (due to information loss in inversion), preventing content leakage from reference images, and composing style with content flexibly.
Why it matters:
  • Fine-tuning large foundation models is computationally expensive and impractical for single-image references
  • Current training-free methods like StyleAligned lose fine-grained details during DDIM inversion
  • Feature injection methods often cause the content of the style reference to leak into the generated image (e.g., a style reference's object appearing in the output)
Concrete Example: When trying to generate a 'cat' in the style of a specific 'oil painting of a house', prior methods might accidentally generate a house instead of a cat (content leakage) or fail to capture the specific brushstrokes of the painting (poor style extraction).
Key Novelty
Stochastic Optimal Control (SOC) for Drift Modulation
  • Formulates the reverse diffusion dynamics as a control problem where an optimal controller modifies the drift (direction) of the generation process to minimize a terminal cost representing style discrepancy
  • Introduces Attention Feature Aggregation (AFA) to explicitly decouple content and style features within attention layers, preventing the reference image's content from overriding the text prompt
Architecture
Architecture Figure Figure 2
Diagram of the Attention Feature Aggregation (AFA) module illustrating how reference image features are integrated into the attention mechanism
Breakthrough Assessment
8/10
Provides a theoretically grounded framework (optimal control) for diffusion personalization that eliminates the need for finetuning or adapters, addressing key issues like content leakage and inversion artifacts.
×