Evaluation Setup
Qualitative and quantitative analysis of editing precision on standard prompts
Benchmarks:
- PARTI prompts (Text-to-image generation prompts)
Metrics:
- BCE (Binary Cross Entropy) vs GT Masks
- Soft mIoU (Mean Intersection over Union)
- MSE (Mean Squared Error)
- Inference Speed (Seconds)
- Statistical methodology: Average ranking over 100 random prompts
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Computational efficiency comparison showing the benefit of Input Projection (q, k replacement) over Attention Map replacement. |
| Inference Time (SD3-M) |
Seconds |
45.6 |
15.2 |
-30.4
|
| Inference Time (Flux.1-dev) |
Seconds |
153.2 |
55.9 |
-97.3
|
Main Takeaways
- I2I blocks in MM-DiT function like U-Net self-attention (structure), while T2I functions like cross-attention (semantics).
- Larger MM-DiT models exhibit noisy attention maps (aligned with ViT scaling laws), necessitating the selection of specific 'clean' blocks for editing masks.
- Modifying input projections (q_i, k_i) is mathematically similar to modifying I2I attention but computationally much faster and avoids T5 text embedding misalignment.
- Injecting source information into *all* blocks prevents editing in few-step distilled models; partial injection is required for Flux.1-schnell/SD3.5-Turbo.