← Back to Paper List

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

Timin Gao, Peixian Chen, Mengdan Zhang, Chaoyou Fu, Yunhang Shen, Yan Zhang, Shengchuan Zhang, Xiawu Zheng, Xing Sun, Liujuan Cao, Rongrong Ji
Not explicitly reported in the paper
ACM Multimedia (2024)
MM Reasoning Agent

📝 Paper Summary

Multimodal Chain-of-Thought (CoT) Visual Reasoning Agentic AI
Cantor improves visual reasoning by injecting visual context early into the decision-making process and using a single Multimodal Large Language Model to role-play diverse expert modules.
Core Problem
Existing Multimodal CoT methods often make decisions using only text (ignoring visual context) and rely on low-level external tools (OCR, detectors) that lack high-level cognitive summarization capabilities.
Why it matters:
  • Text-only decision making leads to 'determining hallucinations,' where the model misinterprets ambiguous questions (e.g., 'this class') without seeing the image.
  • Low-level tools provide raw data (coordinates, bounding boxes) rather than the abstract reasoning needed for complex tasks, overwhelming the LLM with long-context integration burdens.
Concrete Example: A question asks 'What is the highest amount this class measures?' Without the image, a text-only planner guesses 'class' refers to programming or physics. With Cantor, the planner sees the image of a beaker, identifies 'class' refers to the container, and directs a vision expert to read the volume markings.
Key Novelty
Perception-Decision Architecture with MLLM-as-Experts
  • Integrates visual information directly into the initial decision-generation stage to prevent context-free planning errors common in text-only planners.
  • Replaces fragmented external tools (APIs) with a single MLLM acting as distinct 'experts' (e.g., VisionIQ Analyst, ObjectQuant Locator) via prompted role-playing.
  • Does not require fine-tuning or ground-truth rationale annotations, operating purely through advanced prompting strategies.
Evaluation Highlights
  • +4.11% accuracy gain on ScienceQA and +5.9% on MathVista when using Gemini as the decision generator compared to baselines.
  • +9.2% accuracy gain on MathVista when using GPT-3.5 as the decision generator compared to baselines.
  • Achieves State-of-the-Art (SOTA) results on both ScienceQA and MathVista benchmarks without fine-tuning.
Breakthrough Assessment
8/10
Significantly simplifies Multimodal CoT by unifying the planner and executor into MLLM interactions, removing the need for fragile external APIs while achieving SOTA results.
×