MM-DiTs: Multimodal Diffusion Transformers—generative models that process text and image tokens in a single sequence using joint self-attention (e.g., SD3, Flux)
Joint Self-Attention: An attention mechanism where text and image embeddings are concatenated and processed together, allowing bidirectional interaction but entangling features
IARouter: Instance-adaptive Attention Head Router—the proposed module that activates specific attention heads based on their sensitivity to the edit's target semantics
DTR: Dual-token Refinement—a proposed module to refine image and text tokens to maintain guidance in deep layers where text influence typically wanes
RF-Inversion: Rectified Flow Inversion—a method for inverting diffusion processes in rectified flow models, used as a baseline
UNet: A convolutional neural network architecture with a U-shape, commonly used in earlier diffusion models (like Stable Diffusion 1.5/2.0) utilizing explicit cross-attention