EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Diffusion Models Chain-of-Thought Reasoning

EndoCoT enables diffusion models to perform multi-step chain-of-thought reasoning by iteratively refining latent states within the text encoder before guiding image generation, rather than relying on static single-step embeddings.

Core Problem

Current diffusion models use MLLMs as static text encoders that compute embeddings once, failing to activate chain-of-thought processes needed for complex multi-step reasoning tasks like mazes or spatial planning.

Why it matters:

Static guidance prevents the model from decomposing complex instructions into actionable steps, leading to catastrophic failure in tasks requiring logical constraints
Prior methods like DiffThinker inject reasoning externally but result in superficial alignment that breaks down on novel domains or complex topologies
Without endogenous reasoning, diffusion models merely perform pattern matching rather than genuine cognitive processing, limiting their use in logical tasks

Concrete Example: In a 32x32 maze generation task, a standard diffusion model generates a path that looks visually correct but passes through walls because the static text embedding cannot enforce the strict sequential constraints needed for a valid solution.

Key Novelty

Endogenous Chain-of-Thought (EndoCoT)

Iterative Thought Guidance: recursively updates the MLLM's latent hidden states multiple times to simulate a chain-of-thought process before generation
Terminal Thought Grounding: forces the final latent reasoning state to align with explicit textual supervision (the ground truth answer), anchoring the reasoning trajectory
Joint fine-tuning of both the MLLM and Diffusion Transformer (DiT) to synchronize the dynamic reasoning states with the denoising process

Architecture

The EndoCoT framework illustrating the iterative reasoning loop and progressive training stages.

Evaluation Highlights

Achieves 92.1% average accuracy across Maze, TSP, VSP, and Sudoku benchmarks, outperforming the strongest baseline by 8.3 percentage points
Maintains 90% accuracy on complex Maze-32 tasks where baselines fail, outperforming the strongest baseline by 25%
Achieves 95% accuracy on Sudoku-35, outperforming the strongest baseline by 40%

Breakthrough Assessment

8/10

Significant step forward in making diffusion models reason rather than just generate. The shift from static to dynamic/iterative conditioning addresses a fundamental bottleneck in spatial reasoning tasks.

⚙️ Technical Details

Problem Definition

Setting: Conditional visual generation requiring multi-step logical constraints (e.g., pathfinding, constraint satisfaction)

Inputs: Textual prompt and initial visual state (e.g., an empty maze image)

Outputs: Visual solution trajectory (e.g., the solved path through the maze)

Pipeline Flow

Iterative Thought Guidance (MLLM updates latent states T times)
Conditional Flow Generation (DiT generates intermediate visuals from each state)
Terminal Thought Grounding (Align final state with text supervision)

System Modules

Iterative Thought Guidance

Iteratively refines the latent thought state h_tau using the MLLM, bypassing the discrete embedding lookup for steps > 1

Model or implementation: Qwen-Image-Edit-2511 (text encoder part)

Conditional Flow Generation

Generates visual output conditioned on the current thought state h_tau using Flow Matching

Model or implementation: Diffusion Transformer (DiT)

Terminal Thought Grounding

Aligns the final reasoning state h_T with the embedding of the ground-truth text answer

Model or implementation: L2 Loss mechanism (during training)

Novel Architectural Elements

Recursive latent state update loop where the MLLM's output hidden state feeds back as input for the next reasoning step
Multi-stage supervision where intermediate reasoning states condition intermediate denoising processes
Two-stage progressive training: (1) Reasoning Development (supervising all steps), (2) Terminal Consolidation (supervising only final output)

Modeling

Base Model: Qwen-Image-Edit-2511

Training Method: Flow Matching with auxiliary semantic alignment loss

Objective Functions:

Purpose: Train the DiT to generate images from noise based on the current thought state.

Formally: Conditional Flow Matching loss L_FM = E[ || v_theta(z_tau(t), t, h_tau) - u_t(z_tau(t)) ||^2 ]
Purpose: Align the final MLLM latent state with the ground-truth text embedding to prevent drift.

Formally: Semantic Alignment loss L_align = || h_T - P_gt[L+1] ||^2
Purpose: Combined objective balancing generation and grounding.

Formally: L = Sum_tau (L_FM^tau) + lambda_align * I_{tau=T} * L_align

Adaptation: LoRA (Low-Rank Adaptation) applied to both MLLM and DiT components

Key Hyperparameters:

lambda_align: 1

Compute: Not reported in the paper

Comparison to Prior Work

vs. DiffThinker: EndoCoT updates the conditioning embedding iteratively during generation (endogenous), whereas DiffThinker computes it once (static)
vs. Qwen3-VL-8B: EndoCoT integrates the reasoning loop directly into the diffusion process via latent states, rather than just using the VL model as a fixed encoder
vs. Latent Sketchpad: EndoCoT performs reasoning in continuous latent space without decoding intermediate text tokens, enabling denser information flow [not cited in paper]

Limitations

Relies on sequential ground-truth decomposition for intermediate supervision (e.g., partial maze paths), which may not be available for all tasks
Inference cost scales linearly with the number of reasoning steps T
Requires joint fine-tuning of both MLLM and DiT, which is computationally more intensive than training DiT alone

Reproducibility

Code availability is not provided. Model uses Qwen-Image-Edit-2511 as the base. Evaluation benchmarks (Maze, TSP, VSP, Sudoku) are standard but specific dataset construction details would be needed for exact replication.

📊 Experiments & Results

Evaluation Setup

Visual reasoning tasks requiring strict constraint satisfaction

Benchmarks:

Maze (Pathfinding (8x8, 16x16, 32x32))
TSP (Traveling Salesperson Problem (optimization))
VSP (Visual Spatial Planning)
Sudoku (Constraint satisfaction puzzle)

Metrics:

Accuracy (success rate of generated solutions)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Average across all tasks	Accuracy	83.8	92.1	+8.3
Maze-32	Accuracy	65.0	90.0	+25.0
Sudoku-35	Accuracy	55.0	95.0	+40.0

Experiment Figures

Analysis of reasoning bottlenecks: (a) Layer-wise sensitivity, (b) Single-step failure cases, (c) Attention entropy maps.

Performance comparison and qualitative visualization of reasoning trajectories.

Main Takeaways

EndoCoT consistently outperforms baselines across all reasoning benchmarks (Maze, TSP, VSP, Sudoku).
The performance gap widens significantly as task complexity increases (e.g., larger Mazes or Sudoku grids), showing the benefit of iterative reasoning.
Visual analysis confirms EndoCoT produces valid step-by-step reasoning chains, whereas baselines commit to a solution early and fail to correct errors.
Ablation studies confirm the necessity of both iterative thought guidance and terminal thought grounding components.

📚 Prerequisite Knowledge

Prerequisites

Diffusion Models (Flow Matching)
Multimodal Large Language Models (MLLMs)
Chain-of-Thought (CoT) prompting

Key Terms

EndoCoT: Endogenous Chain-of-Thought—the proposed framework for performing iterative reasoning within the diffusion model's latent space

DiT: Diffusion Transformer—a diffusion model architecture based on transformers rather than U-Nets

MLLM: Multimodal Large Language Model—a model capable of processing and generating both text and images

Flow Matching: A generative modeling framework that learns a vector field to transform a simple noise distribution into the data distribution via an ODE

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

TSP: Traveling Salesperson Problem—a classic optimization problem requiring finding the shortest route visiting a set of cities

latent state: The internal hidden representation within the neural network (specifically the MLLM here) that encodes semantic information

VSP: Visual Spatial Planning—tasks requiring reasoning about spatial relationships and paths