Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

📝 Paper Summary

Multimodal Chain-of-Thought (CoT) Visual Reasoning Agentic AI

Cantor improves visual reasoning by injecting visual context early into the decision-making process and using a single Multimodal Large Language Model to role-play diverse expert modules.

Core Problem

Existing Multimodal CoT methods often make decisions using only text (ignoring visual context) and rely on low-level external tools (OCR, detectors) that lack high-level cognitive summarization capabilities.

Why it matters:

Text-only decision making leads to 'determining hallucinations,' where the model misinterprets ambiguous questions (e.g., 'this class') without seeing the image.
Low-level tools provide raw data (coordinates, bounding boxes) rather than the abstract reasoning needed for complex tasks, overwhelming the LLM with long-context integration burdens.

Concrete Example: A question asks 'What is the highest amount this class measures?' Without the image, a text-only planner guesses 'class' refers to programming or physics. With Cantor, the planner sees the image of a beaker, identifies 'class' refers to the container, and directs a vision expert to read the volume markings.

Key Novelty

Perception-Decision Architecture with MLLM-as-Experts

Integrates visual information directly into the initial decision-generation stage to prevent context-free planning errors common in text-only planners.
Replaces fragmented external tools (APIs) with a single MLLM acting as distinct 'experts' (e.g., VisionIQ Analyst, ObjectQuant Locator) via prompted role-playing.
Does not require fine-tuning or ground-truth rationale annotations, operating purely through advanced prompting strategies.

Evaluation Highlights

+4.11% accuracy gain on ScienceQA and +5.9% on MathVista when using Gemini as the decision generator compared to baselines.
+9.2% accuracy gain on MathVista when using GPT-3.5 as the decision generator compared to baselines.
Achieves State-of-the-Art (SOTA) results on both ScienceQA and MathVista benchmarks without fine-tuning.

Breakthrough Assessment

8/10

Significantly simplifies Multimodal CoT by unifying the planner and executor into MLLM interactions, removing the need for fragile external APIs while achieving SOTA results.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot or Few-shot Visual Reasoning Tasks

Inputs: Image I, Text T (problem statement + context), Prompt P_in

Outputs: Final Answer A

Pipeline Flow

Decision-Generation: Analyze I and T to generate plan P_out (strategies + sub-tasks)
Execute-Modularization: Assign sub-tasks to MLLM-based Experts (G) to get sub-answers S_a
Execute-Synthesis: Aggregate sub-tasks and sub-answers into context S to generate final answer A

System Modules

Decision Generator

Analyze image and text to formulate a reasoning plan and assign tasks to experts

Model or implementation: Gemini or GPT-3.5

Expert Modules (Virtual)

Execute specific sub-tasks via role-playing prompts

Model or implementation: Gemini (acting as single MLLM for all experts)

Answer Generator

Synthesize all expert outputs and reasoning into a final answer

Model or implementation: Gemini or GPT-3.5

Novel Architectural Elements

Perception-Decision Architecture: Integrating visual perception directly into the planning/decision phase (rather than just execution)
Soft-Modularization: Using a single MLLM to hallucinate specific 'Expert' roles via prompts instead of calling distinct external software tools

Modeling

Base Model: Gemini (Pro/Ultra not specified, likely Pro based on typical usage) and GPT-3.5

Compute: Not reported in the paper

Comparison to Prior Work

vs. DD-CoT: DD-CoT uses text-only decomposition leading to hallucinations; Cantor uses visual input during decomposition.
vs. MM-CoT: Cantor requires no fine-tuning or ground-truth rationales.
vs. Chameleon [not cited in paper]: Cantor uses a single MLLM as 'soft' tools/experts rather than calling disparate external APIs.

Limitations

Dependency on the capabilities of the underlying MLLM (e.g., Gemini); if the MLLM fails at a sub-task, the chain breaks.
Latency and cost associated with multiple MLLM calls per question (Decision + Expert Calls + Synthesis).
Limited to the predefined set of 'expert' personas defined in the prompt.

Reproducibility

Code: https://ggg0919.github.io/cantor/

Prompt templates and general framework are described. Code is hosted on a GitHub page linked in the paper abstract. The method relies on closed-source APIs (Gemini, GPT-3.5).

📊 Experiments & Results

Evaluation Setup

Zero/Few-shot evaluation on visual reasoning benchmarks

Benchmarks:

ScienceQA (Multimodal science question answering)
MathVista (Visual mathematical reasoning)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ScienceQA	Accuracy	81.22	85.33	+4.11
MathVista	Accuracy	45.7	51.6	+5.9
MathVista	Accuracy	49.9	51.6	+1.7
MathVista	Accuracy	51.5	51.6	+0.1

Main Takeaways

Integrating visual context into the planning stage significantly reduces decision hallucinations.
Using MLLMs as high-level experts (e.g., comparing quantities) is more effective than low-level tools (e.g., detecting bounding boxes) for reasoning tasks.
The framework generalizes across different backend models (Gemini, GPT-3.5) and benchmarks (ScienceQA, MathVista).

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (MLLM)
Chain-of-Thought (CoT) prompting
In-context learning

Key Terms

MLLM: Multimodal Large Language Model—an AI model capable of processing and reasoning over both text and image inputs

CoT: Chain-of-Thought—a prompting strategy that encourages models to generate intermediate reasoning steps before the final answer

Hallucination: A phenomenon where an AI generates plausible but incorrect or nonsensical information not grounded in the input

SOTA: State-of-the-Art—the current best performance achieving method in a specific field

OCR: Optical Character Recognition—conversion of images of text into machine-encoded text