← Back to Paper List

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Leon Liangyu Chen, Haoyu Ma, Zhipeng Fan, Ziqi Huang, Animesh Sinha, Xiaoliang Dai, Jialiang Wang, Zecheng He, Jianwei Yang, Chunyuan Li, Junzhe Sun, Chu Wang, Serena Yeung-Levy, Felix Juefei-Xu
Stanford University, Meta Superintelligence Labs, Nanyang Technological University
arXiv (2026)
MM Reasoning Agent RL

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Test-Time Scaling (TTS)
UniT enables a single unified multimodal model to iteratively generate, verify, and refine visual content at test time by training on synthetic reasoning trajectories and forcing computational budgets.
Core Problem
Current unified multimodal models operate in a single-pass mode, producing outputs without the ability to verify, reflect, or refine them, which limits performance on complex reasoning and compositional tasks.
Why it matters:
  • Tasks involving complex spatial compositions or multi-step editing require iterative self-correction, which single-pass models cannot perform.
  • Capabilities for generation, verification, and editing are currently scattered across specialized models rather than integrated into one system.
  • Test-time scaling (allocating more compute at inference) has improved text reasoning but remains unexplored for unified multimodal models.
Concrete Example: When given a complex prompt like 'a red cube on top of a blue cylinder next to a green sphere,' a single-pass model might generate missing objects or wrong colors. Without a mechanism to 'look' at its output, realize the error (verification), and plan a fix (subgoal decomposition), the model cannot correct itself.
Key Novelty
Unified Multimodal Chain-of-Thought Test-Time Scaling
  • Trains a single unified model on synthetic 'thought' data where vision-language models critique and edit images in a loop, internalizing the verify-refine process.
  • Uses 'budget forcing' at inference: if the model tries to stop early, the system forces it to continue reasoning ('Let's edit the image') until a compute budget is met.
  • generalizes from short training chains to longer inference chains, showing that the model learns the *process* of refinement rather than just memorizing fixed-length patterns.
Evaluation Highlights
  • +53.33% improvement on MIRA (out-of-distribution visual reasoning) by scaling from 1 to 10 rounds.
  • +225.19% improvement on ImgEdit multi-turn editing benchmarks by increasing refinement rounds.
  • Sequential chain-of-thought scaling matches the performance of parallel best-of-N sampling while using 2.5x less computational cost.
Breakthrough Assessment
9/10
Successfully transfers the test-time scaling paradigm (proven in text LLMs like o1) to multimodal unified models, showing massive gains in both generation and understanding with emergent generalization behaviors.
×