Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang
Shanghai AI Laboratory,
Xi’an Jiaotong University,
Shanghai Jiaotong University
arXiv
(2026)
MMReasoningAgent
📝 Paper Summary
Multimodal Large Language Models (MLLMs)Diffusion ModelsChain-of-Thought Reasoning
EndoCoT enables diffusion models to perform multi-step chain-of-thought reasoning by iteratively refining latent states within the text encoder before guiding image generation, rather than relying on static single-step embeddings.
Core Problem
Current diffusion models use MLLMs as static text encoders that compute embeddings once, failing to activate chain-of-thought processes needed for complex multi-step reasoning tasks like mazes or spatial planning.
Why it matters:
Static guidance prevents the model from decomposing complex instructions into actionable steps, leading to catastrophic failure in tasks requiring logical constraints
Prior methods like DiffThinker inject reasoning externally but result in superficial alignment that breaks down on novel domains or complex topologies
Without endogenous reasoning, diffusion models merely perform pattern matching rather than genuine cognitive processing, limiting their use in logical tasks
Concrete Example:In a 32x32 maze generation task, a standard diffusion model generates a path that looks visually correct but passes through walls because the static text embedding cannot enforce the strict sequential constraints needed for a valid solution.
Key Novelty
Endogenous Chain-of-Thought (EndoCoT)
Iterative Thought Guidance: recursively updates the MLLM's latent hidden states multiple times to simulate a chain-of-thought process before generation
Terminal Thought Grounding: forces the final latent reasoning state to align with explicit textual supervision (the ground truth answer), anchoring the reasoning trajectory
Joint fine-tuning of both the MLLM and Diffusion Transformer (DiT) to synchronize the dynamic reasoning states with the denoising process
Architecture
The EndoCoT framework illustrating the iterative reasoning loop and progressive training stages.
Evaluation Highlights
Achieves 92.1% average accuracy across Maze, TSP, VSP, and Sudoku benchmarks, outperforming the strongest baseline by 8.3 percentage points
Maintains 90% accuracy on complex Maze-32 tasks where baselines fail, outperforming the strongest baseline by 25%
Achieves 95% accuracy on Sudoku-35, outperforming the strongest baseline by 40%
Breakthrough Assessment
8/10
Significant step forward in making diffusion models reason rather than just generate. The shift from static to dynamic/iterative conditioning addresses a fundamental bottleneck in spatial reasoning tasks.
Adaptation: LoRA (Low-Rank Adaptation) applied to both MLLM and DiT components
Key Hyperparameters:
lambda_align: 1
Compute: Not reported in the paper
Comparison to Prior Work
vs. DiffThinker: EndoCoT updates the conditioning embedding iteratively during generation (endogenous), whereas DiffThinker computes it once (static)
vs. Qwen3-VL-8B: EndoCoT integrates the reasoning loop directly into the diffusion process via latent states, rather than just using the VL model as a fixed encoder
vs. Latent Sketchpad: EndoCoT performs reasoning in continuous latent space without decoding intermediate text tokens, enabling denser information flow [not cited in paper]
Limitations
Relies on sequential ground-truth decomposition for intermediate supervision (e.g., partial maze paths), which may not be available for all tasks
Inference cost scales linearly with the number of reasoning steps T
Requires joint fine-tuning of both MLLM and DiT, which is computationally more intensive than training DiT alone
Reproducibility
Code availability is not provided. Model uses Qwen-Image-Edit-2511 as the base. Evaluation benchmarks (Maze, TSP, VSP, Sudoku) are standard but specific dataset construction details would be needed for exact replication.
Performance comparison and qualitative visualization of reasoning trajectories.
Main Takeaways
EndoCoT consistently outperforms baselines across all reasoning benchmarks (Maze, TSP, VSP, Sudoku).
The performance gap widens significantly as task complexity increases (e.g., larger Mazes or Sudoku grids), showing the benefit of iterative reasoning.
Visual analysis confirms EndoCoT produces valid step-by-step reasoning chains, whereas baselines commit to a solution early and fail to correct errors.
Ablation studies confirm the necessity of both iterative thought guidance and terminal thought grounding components.
📚 Prerequisite Knowledge
Prerequisites
Diffusion Models (Flow Matching)
Multimodal Large Language Models (MLLMs)
Chain-of-Thought (CoT) prompting
Key Terms
EndoCoT: Endogenous Chain-of-Thought—the proposed framework for performing iterative reasoning within the diffusion model's latent space
DiT: Diffusion Transformer—a diffusion model architecture based on transformers rather than U-Nets
MLLM: Multimodal Large Language Model—a model capable of processing and generating both text and images
Flow Matching: A generative modeling framework that learns a vector field to transform a simple noise distribution into the data distribution via an ODE
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices
TSP: Traveling Salesperson Problem—a classic optimization problem requiring finding the shortest route visiting a set of cities
latent state: The internal hidden representation within the neural network (specifically the MLLM here) that encodes semantic information
VSP: Visual Spatial Planning—tasks requiring reasoning about spatial relationships and paths