Grounded Chain-of-Thought for Multimodal Large Language Models

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Visual Hallucination Chain-of-Thought Reasoning

Grounded Chain-of-Thought (GCoT) mitigates visual hallucination in MLLMs by requiring models to explicitly ground reasoning steps with bounding boxes before predicting an answer.

Core Problem

MLLMs often answer correctly due to language bias (hallucination) rather than visual understanding, and existing benchmarks fail to penalize correct answers derived from irrelevant visual regions.

Why it matters:

Current evaluation metrics (like POPE) can be gamed by models that guess correctly without seeing, masking reliability issues.
Visual hallucination and language bias persist even in state-of-the-art models like GPT-4V, creating risks in safety-critical applications.
Scaling up model size does not automatically solve grounding consistency; larger models often hallucinate more by relying on stronger language priors.

Concrete Example: An MLLM might correctly answer 'What is the person doing?' as 'playing frisbee' based on context, but when asked to point to the frisbee, it highlights a random patch of grass, proving it didn't actually see the object.

Key Novelty

Grounded Chain-of-Thought (GCoT)

Transforms the QA process into a multi-step sequence where the model must identify and provide bounding box coordinates for key visual elements *during* the reasoning chain.
Introduces 'Answer-Grounding Consistency' as a primary metric, penalizing models that get the right answer but fail to ground the evidence.
Constructs MM-GCoT, a dataset where reasoning steps are explicitly linked to spatial coordinates via an automated graph-based generation pipeline.

Architecture

Comparison of GCoT process against Visual Grounding and Grounded QA.

Evaluation Highlights

Proposed LLaVA-7B GCoT achieves a +55.7% improvement in Answer-Grounding Consistency compared to the original LLaVA-7B baseline.
Reveals an 'inverse scaling' phenomenon where larger models perform worse on consistency: Qwen2.5-VL-7B outperforms the 72B version by 18.2% in consistency.
Demonstrates that current SOTA models have severe perception-reasoning gaps; e.g., LLaVA-OneVision-72B has 75.7% accuracy but only 11.1% consistency.

Breakthrough Assessment

7/10

Strong contribution in exposing the 'right answer for wrong reasons' problem with a rigorous dataset and metric. The finding that larger models have worse consistency is a significant insight.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Question Answering with intermediate grounded reasoning steps.

Inputs: Image I and Question T

Outputs: A sequence of reasoning steps (G_t, R_t) including bounding boxes, followed by Final Answer A.

Pipeline Flow

Input (Image + Question)
GCoT Inference (Reasoning Steps + Grounding)
Final Answer Generation

System Modules

MLLM Backbone

Processes image features and text instructions to generate grounded reasoning steps and answers

Model or implementation: Evaluated on various backbones (LLaVA, Qwen-VL, InternVL)

Novel Architectural Elements

Integration of bounding box prediction directly into the chain-of-thought token sequence (Grounded CoT) for general QA, distinct from specialized grounding heads.

Modeling

Base Model: LLaVA-1.5 (7B/13B) for the main GCoT training experiments

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Minimize difference between predicted tokens (text + coordinates) and ground truth.

Formally: Standard auto-regressive language modeling loss.

Training Data:

MM-GCoT Dataset: 23,028 training samples, 994 test samples.
Constructed from Visual Genome by matching regions, building spatial graphs, filling templates, and rewriting via LLM.

Key Hyperparameters:

notes: Exact hyperparameters (LR, batch size) not explicitly listed in text, stated to follow 'default settings of LLaVA'.

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLaVA-CoT: GCoT requires explicit spatial coordinates for evidence, not just text explanations.
vs. Grounded QA: GCoT enforces multi-step decomposition and spatial reasoning before the answer, rather than just grounding the final entity.
vs. VoCoT: MM-GCoT focuses specifically on alleviating visual hallucination (language bias) and introduces consistency metrics.

Limitations

Smaller models (e.g., Qwen2.5-VL-3B) struggle to follow GCoT instructions strictly.
Consistency metrics are strict; valid reasoning with slightly imprecise boxes might be penalized (Acc@0.5).
Evaluation is limited to the custom MM-GCoT test set for consistency metrics.
The approach relies on SFT data derived from Visual Genome, which may have its own biases.

Reproducibility

Code: https://github.com/DoubtedSteam/MM-GCoT

Dataset and evaluation scripts are released. Code URL provided. Training follows standard LLaVA SFT protocols.

📊 Experiments & Results

Evaluation Setup

Multimodal QA with grounding verification on the MM-GCoT benchmark.

Benchmarks:

MM-GCoT Test Set (Grounded QA (Attribute, Judgment, Object tasks)) [New]

Metrics:

Answer Accuracy (A-Acc)
Grounding Accuracy (G-Acc, IoU>0.5)
Answer-Grounding Consistency (Consist.)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of state-of-the-art MLLMs reveals a severe lack of consistency between answers and visual evidence.
MM-GCoT	Answer-Grounding Consistency	100.0	11.1	-88.9
MM-GCoT	Answer-Grounding Consistency	100.0	35.4	-64.6
Impact of model scaling on consistency shows an inverse relationship: larger models often perform worse.
MM-GCoT	Answer-Grounding Consistency	Not reported in the paper	Not reported in the paper	+18.2
Training with GCoT data significantly improves consistency.
MM-GCoT	Answer-Grounding Consistency	Not reported in the paper	Not reported in the paper	+55.7
MM-GCoT	Answer-Grounding Consistency	Not reported in the paper	Not reported in the paper	+61.2

Experiment Figures

Visualization of Qwen2.5-VL and InternVL2.5 outputs under answer-first and grounding-first prompts.

Qualitative comparison between LLaVA-1.5 13B GCoT and InternVL2.5-78B.

Main Takeaways

Existing MLLMs exhibit severe visual hallucination: high answer accuracy often masks poor visual grounding (e.g., LLaVA-OneVision-72B has 75.7% Acc but only 11.1% Consistency).
Visual hallucination is not solved by scale; smaller models often have better answer-grounding consistency than their larger counterparts (e.g., Qwen-7B > Qwen-72B).
GCoT training effectively aligns reasoning with perception, boosting consistency metrics significantly (up to +61.2%) without needing architectural changes.
Models perform best on the 'Object' task subset but struggle with 'Attribute' and 'Judgment' tasks which require more fine-grained spatial reasoning.

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (MLLMs)
Visual Grounding / Referring Expression Comprehension
Chain-of-Thought (CoT) Prompting
Supervised Fine-Tuning (SFT)

Key Terms

GCoT: Grounded Chain-of-Thought—a reasoning process where MLLMs output bounding box coordinates for relevant objects alongside text steps before answering.

Answer-Grounding Consistency: A metric measuring the percentage of samples where the model predicts *both* the correct text answer and the correct bounding box evidence.

Visual Hallucination: In this context, specifically refers to MLLMs generating correct answers based on language priors (bias) rather than actual visual perception.

IoU: Intersection over Union—a standard metric to evaluate the overlap between a predicted bounding box and the ground truth box.

SFT: Supervised Fine-Tuning—training a pre-trained model on a specific labeled dataset to adapt its behavior.

Visual Genome: A large-scale dataset providing detailed scene graphs, object bounding boxes, and attribute annotations used here to construct MM-GCoT.

Acc@0.5: Accuracy where a prediction is considered correct only if the Intersection over Union (IoU) with the ground truth is greater than 0.5.