SIDiffAgent: Self-Improving Diffusion Agent

📝 Paper Summary

Text-to-Image Generation Agentic AI Frameworks

SIDiffAgent is a training-free multi-agent framework that optimizes text-to-image generation by refining prompts, performing targeted edits, and learning from past successes and failures via a structured memory.

Core Problem

State-of-the-art diffusion models suffer from an 'intent gap' due to sensitivity to prompt phrasing and a mismatch between user prompts and training captions.

Why it matters:

Ambiguous prompts (e.g., 'mouse') lead to unintended outputs (animal vs. peripheral) because models lack context for user intent
Underspecified prompts force models to make unguided assumptions, increasing computational costs as users must repeatedly regenerate images to get desired results

Concrete Example: A prompt like 'a car on a road' is underspecified. Without the agentic framework, the model generates generic cars. With SIDiffAgent, the system infers details (model, color) based on creativity levels, or fixes specific issues like a wall clock failing to show '10:10' by retrieving past failure patterns.

Key Novelty

Theory-of-Mind Inspired Self-Improving Diffusion Agent

Introduces an experience-driven memory mechanism that records 'pitfalls' and 'successes' at each decision node, allowing the system to retrieve corrective guidance for future similar prompts
Implements a 'Theory of Mind' approach where sub-agents anticipate the behavior and potential failures of other agents (e.g., the orchestrator knowing the generator struggles with specific objects)
Combines generation with a dedicated editing loop (Qwen-Image-Edit) that fixes artifacts based on an evaluator's report without requiring full image regeneration

Architecture

The overall workflow of SIDiffAgent, illustrating the interaction between the User, Guidance Agent, Generation Orchestrator, Evaluation Agent, and the Knowledge Base.

Evaluation Highlights

+8.73% improvement in VQA Score on GenAIBench compared to the prior agentic system T2I-Copilot
+5.36% improvement over the proprietary Imagen 3 model on GenAIBench benchmarks
+15.70% improvement over Stable Diffusion 3.5 (SD 3.5) on GenAIBench

Breakthrough Assessment

8/10

Significant because it introduces a self-improving memory loop to diffusion agents without requiring model training (fine-tuning), addressing the critical 'intent gap' in generative AI effectively.

⚙️ Technical Details

Problem Definition

Setting: Text-to-Image (T2I) generation with automated refinement and self-improvement

Inputs: Natural language user prompt

Outputs: Final generated image aligned with user intent and free of artifacts

Pipeline Flow

Guidance Agent (Retrieves memory)
Generation Orchestrator (Preprocesses prompts)
Generation Sub-Agent (Generates image)
Evaluation Agent (Checks quality)
Editing Loop (if needed)
Memory Update (Stores trajectory)

System Modules

Guidance Agent

Retrieves past 'pitfalls' and 'successes' relevant to the current prompt to generate corrective guidance

Model or implementation: Qwen-Embedding (for retrieval)

Creativity Analysis Sub-Agent (Orchestration)

Determines the 'creativity level' (high, medium, low) to constrain how much the system can invent details