← Back to Paper List

Grounded Chain-of-Thought for Multimodal Large Language Models

Qiong Wu, Xiang Yang, Yiyi Zhou, Chenxin Fang, Baiyang Song, Xiaoshuai Sun, Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
arXiv.org (2025)
MM Factuality Reasoning Benchmark

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Visual Hallucination Chain-of-Thought Reasoning
Grounded Chain-of-Thought (GCoT) mitigates visual hallucination in MLLMs by requiring models to explicitly ground reasoning steps with bounding boxes before predicting an answer.
Core Problem
MLLMs often answer correctly due to language bias (hallucination) rather than visual understanding, and existing benchmarks fail to penalize correct answers derived from irrelevant visual regions.
Why it matters:
  • Current evaluation metrics (like POPE) can be gamed by models that guess correctly without seeing, masking reliability issues.
  • Visual hallucination and language bias persist even in state-of-the-art models like GPT-4V, creating risks in safety-critical applications.
  • Scaling up model size does not automatically solve grounding consistency; larger models often hallucinate more by relying on stronger language priors.
Concrete Example: An MLLM might correctly answer 'What is the person doing?' as 'playing frisbee' based on context, but when asked to point to the frisbee, it highlights a random patch of grass, proving it didn't actually see the object.
Key Novelty
Grounded Chain-of-Thought (GCoT)
  • Transforms the QA process into a multi-step sequence where the model must identify and provide bounding box coordinates for key visual elements *during* the reasoning chain.
  • Introduces 'Answer-Grounding Consistency' as a primary metric, penalizing models that get the right answer but fail to ground the evidence.
  • Constructs MM-GCoT, a dataset where reasoning steps are explicitly linked to spatial coordinates via an automated graph-based generation pipeline.
Architecture
Architecture Figure Figure 2
Comparison of GCoT process against Visual Grounding and Grounded QA.
Evaluation Highlights
  • Proposed LLaVA-7B GCoT achieves a +55.7% improvement in Answer-Grounding Consistency compared to the original LLaVA-7B baseline.
  • Reveals an 'inverse scaling' phenomenon where larger models perform worse on consistency: Qwen2.5-VL-7B outperforms the 72B version by 18.2% in consistency.
  • Demonstrates that current SOTA models have severe perception-reasoning gaps; e.g., LLaVA-OneVision-72B has 75.7% accuracy but only 11.1% consistency.
Breakthrough Assessment
7/10
Strong contribution in exposing the 'right answer for wrong reasons' problem with a rigorous dataset and metric. The finding that larger models have worse consistency is a significant insight.
×