← Back to Paper List

HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads

Yu Xu, Fan Tang, Juan Cao, Yuxin Zhang, Xiaoyu Kong, Jintao Li, Oliver Deussen, Tong-Yee Lee
arXiv (2024)
MM Benchmark

📝 Paper Summary

Multimodal Diffusion Transformers (MM-DiTs) Text-guided Image Editing
HeadRouter enables precise image editing in Multimodal Diffusion Transformers by identifying and routing text guidance specifically to attention heads that are sensitive to the target editing semantics.
Core Problem
Multimodal Diffusion Transformers (MM-DiTs) utilize joint self-attention that entangles text and image features, unlike UNet models which use distinct cross-attention maps to localize text guidance.
Why it matters:
  • The lack of explicit cross-attention maps in MM-DiTs (like SD3 and Flux) leads to semantic misalignment when trying to edit specific regions or attributes based on text prompts
  • Existing editing methods designed for UNets (like Prompt-to-Prompt) rely on architecture-specific features that do not exist in the joint attention blocks of MM-DiTs
Concrete Example: In a UNet, changing 'cat' to 'dog' uses a cross-attention map to localize the edit. In an MM-DiT, the text and image tokens are mixed in a single sequence; without explicit guidance maps, a naive edit might fail to change the animal or corrupt the background because the model cannot isolate the 'animal' semantic influence.
Key Novelty
Instance-adaptive Attention Head Routing
  • Discovers that different attention heads in MM-DiTs are naturally specialized for different semantics (e.g., color vs. shape) via sensitivity analysis
  • Dynamically identifies these sensitive heads during inference and routes the text guidance specifically to them, amplifying the edit signal where it matters most
Breakthrough Assessment
7/10
Addresses a critical architectural gap in the shift from UNets to DiTs for editing. The insight about head specialization in MM-DiTs is valuable, though the snippet lacks quantitative proof of the performance leap.
×