GoT introduces a paradigm where an MLLM first generates a semantic-spatial reasoning chain explaining the scene layout, which then guides a diffusion model via a specialized guidance module to produce images.
Core Problem
Current diffusion models map text directly to pixels without explicit reasoning, struggling with complex scenes requiring precise spatial arrangements and object interactions that humans naturally plan.
Why it matters:
Existing methods treat text prompts as static representations, failing to capture the step-by-step logic required for complex compositions
There is a disconnect between the advanced reasoning capabilities of MLLMs and the lack of reasoning in visual generation systems
Previous layout-based methods treat planning and generation as separate stages rather than an integrated end-to-end reasoning process
Concrete Example:When tasked to 'replace the giant leaf with an umbrella', standard models might just swap pixels based on keywords. GoT first reasons: 'analyze scene -> plan edit at specific coordinates -> describe final state', ensuring the umbrella is correctly grounded and the scene remains coherent.
Key Novelty
Generation Chain-of-Thought (GoT)
Shifts visual generation from direct mapping to a reasoning-guided process where the model outputs a natural language plan with spatial coordinates before generating pixels
Uses a novel Semantic-Spatial Guidance Module (SSGM) to inject the MLLM's reasoning chain directly into the diffusion process via semantic embeddings and spatial masks
Unifies generation and editing in one framework by treating editing as a reasoning task involving reference image analysis and modification planning
Architecture
The overall GoT framework architecture, illustrating the flow from input prompt to MLLM reasoning, to the Semantic-Spatial Guidance Module, and finally to the Diffusion Model.
Breakthrough Assessment
8/10
Proposes a significant architectural shift by integrating explicit MLLM reasoning chains into the diffusion loop end-to-end, supported by a massive new dataset (9M+ samples).
⚙️ Technical Details
Problem Definition
Setting: Text-to-Image Generation and Instruction-based Image Editing with explicit reasoning intermediate steps
Inputs: Textual prompt (and Reference Image for editing tasks)
Outputs: Reasoning chain (text + coordinates) followed by the Generated/Edited Image
vs. ORA [not cited in paper]: ORA also uses progressive reasoning for generation, but GoT specifically formulates a unified semantic-spatial chain with coordinate grounding
Limitations
Heavy computational cost for data construction (100 A100s for >1 month)
Relies on the capability of the upstream MLLM (Qwen2.5-VL) for accurate reasoning and grounding
Inference latency is likely higher due to the sequential generation of the reasoning chain before image synthesis (implied by design)
Code, datasets, and pretrained models are publicly available at https://github.com/rongyaofang/GoT. The data creation pipeline uses Qwen2-VL and Qwen2.5. Training required significant compute (100 A100s for data generation).
📊 Experiments & Results
Evaluation Setup
Text-to-Image Generation and Instruction-driven Image Editing
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
The automated data creation pipeline for constructing the GoT datasets using Qwen2-VL and Qwen2.5.
Main Takeaways
The paper constructs the first large-scale reasoning chain dataset for visual generation, comprising over 9 million samples across generation and editing tasks.
The unified framework successfully handles both text-to-image generation and complex image editing (single and multi-turn) within a single architecture.
Qualitative examples demonstrate that explicit spatial reasoning allows for precise object placement and complex manipulation (e.g., swapping objects while maintaining scene coherence) that baseline diffusion models struggle with.
The system supports interactive generation, allowing users to modify the intermediate reasoning chain (text or coordinates) to precisely control the final image output.
📚 Prerequisite Knowledge
Prerequisites
Diffusion Models (Latent Diffusion/SDXL)
Multimodal Large Language Models (MLLMs)
Chain-of-Thought (CoT) Reasoning
Low-Rank Adaptation (LoRA)
Key Terms
GoT: Generation Chain-of-Thought—a paradigm where models output step-by-step reasoning and spatial plans before generating images
SSGM: Semantic-Spatial Guidance Module—a component that converts MLLM reasoning chains into embeddings and spatial masks to guide the diffusion model
MLLM: Multimodal Large Language Model—AI models capable of processing and generating both text and visual data
SDXL: Stable Diffusion XL—a large-scale latent diffusion model used as the generation backbone
VAE: Variational Autoencoder—used here to encode spatial masks and reference images into the latent space for diffusion guidance
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique used here to update the MLLM decoder