GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

📝 Paper Summary

Text-to-Image Generation Instruction-based Image Editing Chain-of-Thought Reasoning

GoT introduces a paradigm where an MLLM first generates a semantic-spatial reasoning chain explaining the scene layout, which then guides a diffusion model via a specialized guidance module to produce images.

Core Problem

Current diffusion models map text directly to pixels without explicit reasoning, struggling with complex scenes requiring precise spatial arrangements and object interactions that humans naturally plan.

Why it matters:

Existing methods treat text prompts as static representations, failing to capture the step-by-step logic required for complex compositions
There is a disconnect between the advanced reasoning capabilities of MLLMs and the lack of reasoning in visual generation systems
Previous layout-based methods treat planning and generation as separate stages rather than an integrated end-to-end reasoning process

Concrete Example: When tasked to 'replace the giant leaf with an umbrella', standard models might just swap pixels based on keywords. GoT first reasons: 'analyze scene -> plan edit at specific coordinates -> describe final state', ensuring the umbrella is correctly grounded and the scene remains coherent.

Key Novelty

Generation Chain-of-Thought (GoT)

Shifts visual generation from direct mapping to a reasoning-guided process where the model outputs a natural language plan with spatial coordinates before generating pixels
Uses a novel Semantic-Spatial Guidance Module (SSGM) to inject the MLLM's reasoning chain directly into the diffusion process via semantic embeddings and spatial masks
Unifies generation and editing in one framework by treating editing as a reasoning task involving reference image analysis and modification planning

Architecture

The overall GoT framework architecture, illustrating the flow from input prompt to MLLM reasoning, to the Semantic-Spatial Guidance Module, and finally to the Diffusion Model.

Breakthrough Assessment

8/10

Proposes a significant architectural shift by integrating explicit MLLM reasoning chains into the diffusion loop end-to-end, supported by a massive new dataset (9M+ samples).

⚙️ Technical Details

Problem Definition

Setting: Text-to-Image Generation and Instruction-based Image Editing with explicit reasoning intermediate steps

Inputs: Textual prompt (and Reference Image for editing tasks)

Outputs: Reasoning chain (text + coordinates) followed by the Generated/Edited Image

Pipeline Flow

Input Processing (Text/Ref Image) -> MLLM Reasoning (Qwen2.5-VL)
Reasoning Chain Generation (Text + Coords)
Guidance Extraction (Semantic Embeddings + Spatial Masks)
SSGM Diffusion (SDXL) -> Final Image

System Modules

Reasoning Engine

Generate step-by-step reasoning chains including object attributes, relationships, and bounding box coordinates

Model or implementation: Qwen2.5-VL-3B

Spatial Guidance Encoder

Convert explicit coordinates from the reasoning chain into spatial latent features

Model or implementation: Color-coded Mask Generator + VAE Encoder

Diffusion Generator

Synthesize the final image conditioned on reasoning guidance

Model or implementation: SDXL-based Diffusion Model with Semantic-Spatial Guidance Module (SSGM)

Novel Architectural Elements

Semantic-Spatial Guidance Module (SSGM) integrating three distinct guidance pathways (semantic embeddings from MLLM, spatial latents from parsed coords, reference image latents)
End-to-end differentiability where diffusion gradients backpropagate to the MLLM via the semantic guidance embeddings (G_t)

Modeling

Base Model: Qwen2.5-VL-3B (Reasoning) + SDXL (Generation)

Training Method: End-to-end joint optimization

Objective Functions:

Purpose: Train the MLLM to generate correct reasoning chains.

Formally: Cross-entropy loss on GoT reasoning tokens
Purpose: Train the diffusion model to generate images matching the reasoning.

Formally: Mean Squared Error (MSE) loss on noise prediction

Adaptation: LoRA (Low-Rank Adaptation) for Qwen2.5-VL decoder; Full fine-tuning for SDXL

Training Data:

Text-to-Image: 8.4M samples (LAHR, JourneyDB, FLUX-generated)
Image Editing: 920K samples (OmniEdit, SEED-Edit-Multiturn)
Data created using pipeline of Qwen2-VL (description/grounding) and Qwen2.5 (entity extraction/reasoning synthesis)

Key Hyperparameters:

pretraining_steps: 60,000
finetuning_steps: 10,000
loss_weighting: 1.0 (equal weighting for MLLM token loss and Diffusion MSE)
+ 2 more
conditioning_dropout: 5% (for classifier-free guidance)
coordinate_range: [0, 1000)

Compute: 100 NVIDIA A100 GPUs for over a month (for data creation)

Comparison to Prior Work

vs. GLIGEN: GoT generates the layout itself via reasoning chains rather than requiring it as input
vs. LayoutGPT: GoT integrates reasoning and generation end-to-end rather than treating them as disjoint stages
vs. SmartEdit: GoT incorporates explicit spatial reasoning (coordinates) alongside semantic understanding
+ 1 more
vs. ORA [not cited in paper]: ORA also uses progressive reasoning for generation, but GoT specifically formulates a unified semantic-spatial chain with coordinate grounding

Limitations

Heavy computational cost for data construction (100 A100s for >1 month)
Relies on the capability of the upstream MLLM (Qwen2.5-VL) for accurate reasoning and grounding
Inference latency is likely higher due to the sequential generation of the reasoning chain before image synthesis (implied by design)

Reproducibility

Code: https://github.com/rongyaofang/GoT

Code, datasets, and pretrained models are publicly available at https://github.com/rongyaofang/GoT. The data creation pipeline uses Qwen2-VL and Qwen2.5. Training required significant compute (100 A100s for data generation).

📊 Experiments & Results

Evaluation Setup

Text-to-Image Generation and Instruction-driven Image Editing

Benchmarks:

LAHR-GoT (Text-to-Image Generation) [New]
OmniEdit-GoT (Single-turn Image Editing) [New]
SEED-Edit-Multiturn-GoT (Multi-turn Image Editing) [New]

Metrics:

Not reported in the paper
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

The automated data creation pipeline for constructing the GoT datasets using Qwen2-VL and Qwen2.5.

Main Takeaways

The paper constructs the first large-scale reasoning chain dataset for visual generation, comprising over 9 million samples across generation and editing tasks.
The unified framework successfully handles both text-to-image generation and complex image editing (single and multi-turn) within a single architecture.
Qualitative examples demonstrate that explicit spatial reasoning allows for precise object placement and complex manipulation (e.g., swapping objects while maintaining scene coherence) that baseline diffusion models struggle with.
The system supports interactive generation, allowing users to modify the intermediate reasoning chain (text or coordinates) to precisely control the final image output.

📚 Prerequisite Knowledge

Prerequisites

Diffusion Models (Latent Diffusion/SDXL)
Multimodal Large Language Models (MLLMs)
Chain-of-Thought (CoT) Reasoning
Low-Rank Adaptation (LoRA)

Key Terms

GoT: Generation Chain-of-Thought—a paradigm where models output step-by-step reasoning and spatial plans before generating images

SSGM: Semantic-Spatial Guidance Module—a component that converts MLLM reasoning chains into embeddings and spatial masks to guide the diffusion model

MLLM: Multimodal Large Language Model—AI models capable of processing and generating both text and visual data

SDXL: Stable Diffusion XL—a large-scale latent diffusion model used as the generation backbone

VAE: Variational Autoencoder—used here to encode spatial masks and reference images into the latent space for diffusion guidance

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique used here to update the MLLM decoder