ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

📝 Paper Summary

Text-to-Image In-Context Learning (T2I-ICL) Unified Multimodal LLMs (MLLMs) Chain-of-Thought Reasoning

ImageGen-CoT improves Text-to-Image In-Context Learning by teaching MLLMs to generate explicit reasoning steps before image synthesis, supported by a novel dataset and hybrid test-time scaling.

Core Problem

Unified Multimodal LLMs struggle to infer implicit patterns from interleaved text-image examples in T2I-ICL tasks, often failing to grasp contextual relationships or preserve compositional consistency.

Why it matters:

Current models fail to replicate human-like reasoning where concepts are learned from context (e.g., seeing 'leather book' -> 'leather apple' implies 'leather' style for new objects)
Standard fine-tuning methods for subject customization are resource-intensive and lack rapid generalization capabilities
Existing unified MLLMs produce disorganized thought processes when prompted zero-shot, leading to suboptimal image generation

Concrete Example: Given context 'a leather-bound book' then 'a leather apple', when asked for 'a box', a standard model might generate a generic box. A human infers the 'leather' pattern to imagine 'a leather box'. Current MLLMs fail to make this implicit style transfer.

Key Novelty

ImageGen-CoT (Image Generation Chain-of-Thought)

Introduce a dedicated reasoning step ('thought process') prior to image generation where the model explicitly articulates the style, subject, or relationship inferred from context
Construct a high-quality dataset via an automated pipeline where an MLLM acts as Generator, Selector, Critic, and Refiner to create perfect reasoning-image pairs
Deploy a 'Hybrid Scaling' strategy at inference time that first generates multiple reasoning chains (reasoning diversity) and then samples multiple images per chain (generation diversity)

Architecture

The automated data construction pipeline and the hybrid scaling strategy.

Evaluation Highlights

SEED-X fine-tuned with ImageGen-CoT improves by 89% on CoBSAT and 114% on DreamBench++ benchmarks relative to base SEED-X
Achieves a score of 0.909 on CoBSAT (up from 0.349 baseline) using the proposed hybrid scaling strategy
Fine-tuning with the curated dataset outperforms simple prompting strategies, with SEED-X achieving 0.543 on DreamBench++ (vs 0.188 baseline)

Breakthrough Assessment

8/10

Significant performance jumps (80-114%) on established benchmarks. Successfully adapts NLP's Chain-of-Thought and test-time scaling paradigms to the multimodal generation domain.

⚙️ Technical Details

Problem Definition

Setting: Text-to-Image In-Context Learning (T2I-ICL) where the model receives interleaved text-image demonstrations and a query text

Inputs: Sequence of interleaved images and text I_1, T_1, ..., I_k, T_k followed by query text T_query

Outputs: A generated image I_target that reflects the implicit pattern or subject defined in the context

Pipeline Flow

Step 1: Reasoning Generation (Model generates ImageGen-CoT text)
Step 2: Image Synthesis (Model takes Original Input + ImageGen-CoT + <image> token to generate visual tokens/embeddings)

System Modules

Reasoning Generator

Generate the explicit thought process (ImageGen-CoT) explaining the context

Model or implementation: Unified MLLM (SEED-X or SEED-LLaMA)

Image Generator

Generate the final image conditioned on the reasoning

Model or implementation: Unified MLLM (SEED-X or SEED-LLaMA)

Novel Architectural Elements

Two-stage inference protocol: Explicitly separates reasoning text generation from image token generation to prevent mode collapse where models skip reasoning

Modeling

Base Model: SEED-LLaMA (discrete tokens) and SEED-X (continuous embeddings)

Training Method: Supervised Fine-Tuning (SFT) on curated ImageGen-CoT dataset

Objective Functions:

Purpose: Train model to generate reasoning text.

Formally: Standard Language Modeling loss (lm_loss) on text tokens.
Purpose: Train model to generate image tokens/embeddings.

Formally: MSE loss for continuous embeddings (SEED-X) or LM loss for discrete tokens (SEED-LLaMA).

Training Data:

Automated pipeline: MLLM generates N reasoning+image pairs
Selector picks best image
Critic provides feedback if quality is low
Refiner improves prompt iteratively
Dataset splits: Split 1 for text generation (reasoning), Split 2 for image generation

Key Hyperparameters:

sampling_temperature: 0.7
top_p: 0.8
max_iterations_data_gen: 2

Compute: Not reported in the paper

Comparison to Prior Work

vs. Emu2: Explicitly generates reasoning steps before image generation rather than implicit binding
vs. Best-of-N: Proposes 'Hybrid Scaling' (Multiple Chains x Multiple Images) rather than just multiple images or multiple seeds
vs. Standard T2I-ICL: Uses a fine-tuned reasoning process rather than relying on zero-shot in-context learning capabilities

Limitations

SEED performance on DreamBench remains unchanged with just prompting due to limited base comprehension
Two-stage inference protocol increases latency compared to single-pass generation
Relies on the quality of the automated dataset construction pipeline

Reproducibility

Code: https://ImageGen-CoT.github.io/

Code and model weights will be open-sourced at https://ImageGen-CoT.github.io/. Data construction uses InternVL2.5-78B and FLUX.1-schnell.

📊 Experiments & Results

Evaluation Setup

T2I-ICL tasks requiring pattern inference from interleaved examples

Benchmarks:

CoBSAT (Text-to-Image In-Context Learning (10 tasks including style transfer, counting, etc.))
DreamBench++ (Subject-driven generation (Customization))

Metrics:

Average Score (CoBSAT)
DINO Score / CLIP Score (DreamBench++ inferred)
Pass@N (for scaling)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Prompting with ImageGen-CoT provides immediate gains over baselines, though some models struggle with complex tasks without fine-tuning.
CoBSAT	Average Score	0.349	0.439	+0.090
DreamBench++	Average Score	0.188	0.347	+0.159
Fine-tuning with the curated ImageGen-CoT dataset yields substantial performance improvements, surpassing prompting methods.
CoBSAT	Average Score	0.349	0.660	+0.311
DreamBench++	Average Score	0.188	0.402	+0.214
Test-time scaling (Hybrid strategy) pushes performance even further by leveraging both reasoning and generation diversity.
CoBSAT	Average Score	0.660	0.909	+0.249
DreamBench++	Average Score	0.402	0.543	+0.141

Experiment Figures

Bar chart comparing SEED-X Base, SEED-X + ImageGen-CoT (Prompt), SEED-X + ImageGen-CoT (Fine-Tuned), and SEED-X + Scaling on CoBSAT and DreamBench++.

Main Takeaways

Explicit reasoning (ImageGen-CoT) is crucial for T2I-ICL: Forcing the model to verbalize the pattern before generating the image significantly improves consistency.
Automated dataset curation works: Using a Generator-Selector-Critic-Refiner pipeline creates high-quality training data that enables effective fine-tuning.
Hybrid scaling outperforms single-dimension scaling: Expanding both the reasoning chains (Multi-Chain) and image variations (Single-Chain) simultaneously yields the best results, suggesting comprehension and generation are distinct bottlenecks.

📚 Prerequisite Knowledge

Prerequisites

In-Context Learning (ICL)
Unified Multimodal Large Language Models (MLLMs)
Chain-of-Thought (CoT) prompting
Diffusion Models for image generation

Key Terms

ImageGen-CoT: A structured reasoning text generated by the model before the image, explaining the inferred pattern or subject characteristics

T2I-ICL: Text-to-Image In-Context Learning—generating images based on patterns learned from a few examples provided in the prompt

Unified MLLMs: Models capable of processing and generating both text and images within a single architecture (e.g., SEED-X, SEED-LLaMA)

Test-time scaling: Increasing computational budget during inference (e.g., by generating more samples) to improve performance

Pass@N: A metric evaluating if at least one correct output exists among N generated samples

DreamBench++: A benchmark for subject-driven image generation, evaluating the ability to generate images of a specific subject in different contexts

CoBSAT: A benchmark for T2I-ICL evaluating capabilities like style transfer, subject binding, and identifying relationships