← Back to Paper List

SIDiffAgent: Self-Improving Diffusion Agent

Shivank Garg, Ayush Singh, Gaurav Kumar Nayak
Indian Institute of Technology Roorkee
arXiv (2026)
Agent Memory MM RL Benchmark

📝 Paper Summary

Text-to-Image Generation Agentic AI Frameworks
SIDiffAgent is a training-free multi-agent framework that optimizes text-to-image generation by refining prompts, performing targeted edits, and learning from past successes and failures via a structured memory.
Core Problem
State-of-the-art diffusion models suffer from an 'intent gap' due to sensitivity to prompt phrasing and a mismatch between user prompts and training captions.
Why it matters:
  • Ambiguous prompts (e.g., 'mouse') lead to unintended outputs (animal vs. peripheral) because models lack context for user intent
  • Underspecified prompts force models to make unguided assumptions, increasing computational costs as users must repeatedly regenerate images to get desired results
Concrete Example: A prompt like 'a car on a road' is underspecified. Without the agentic framework, the model generates generic cars. With SIDiffAgent, the system infers details (model, color) based on creativity levels, or fixes specific issues like a wall clock failing to show '10:10' by retrieving past failure patterns.
Key Novelty
Theory-of-Mind Inspired Self-Improving Diffusion Agent
  • Introduces an experience-driven memory mechanism that records 'pitfalls' and 'successes' at each decision node, allowing the system to retrieve corrective guidance for future similar prompts
  • Implements a 'Theory of Mind' approach where sub-agents anticipate the behavior and potential failures of other agents (e.g., the orchestrator knowing the generator struggles with specific objects)
  • Combines generation with a dedicated editing loop (Qwen-Image-Edit) that fixes artifacts based on an evaluator's report without requiring full image regeneration
Architecture
Architecture Figure Figure 1
The overall workflow of SIDiffAgent, illustrating the interaction between the User, Guidance Agent, Generation Orchestrator, Evaluation Agent, and the Knowledge Base.
Evaluation Highlights
  • +8.73% improvement in VQA Score on GenAIBench compared to the prior agentic system T2I-Copilot
  • +5.36% improvement over the proprietary Imagen 3 model on GenAIBench benchmarks
  • +15.70% improvement over Stable Diffusion 3.5 (SD 3.5) on GenAIBench
Breakthrough Assessment
8/10
Significant because it introduces a self-improving memory loop to diffusion agents without requiring model training (fine-tuning), addressing the critical 'intent gap' in generative AI effectively.
×