VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought

📝 Paper Summary

Visual Question Answering (VQA) Chart Understanding

VisDoT improves chart reasoning by training LVLMs on cognitive perceptual tasks and using a prompting strategy that separates visual element identification from logical reasoning.

Core Problem

Large vision-language models (LVLMs) struggle to align visual primitives (like position or length) with semantic concepts in charts, leading to failure in complex reasoning tasks.

Why it matters:

Models fail to reliably detect visual primitives when users don't explicitly name identifiers (e.g., legend labels).
Standard Chain-of-Thought (CoT) works for text but lacks grounding for visual attributes like color and spatial coordinates.
Instruction tuning often focuses on simple keyword-value mappings rather than the high-level perceptual alignment needed for multi-object comparison.

Concrete Example: In a chart query, a user might ask about a data trend without naming the axis labels. Current LVLMs fail to 'ground' the visual bars to their values before reasoning, resulting in hallucinated statistics.

Key Novelty

Human-Like Interpretation Grounding & Decomposition of Thought (DoT)

Formalizes four perceptual tasks (Position, Length, Pattern, Extract) based on Cleveland & McGill's graphical perception theory to align model attention with human visual processing.
Introduces Decomposition-of-Thought (DoT), a prompting strategy that explicitly splits a query into 'perception' sub-questions (finding elements) and 'logic' sub-questions (reasoning about them).
Treats VQA as a sequential process where perception questions must be answered to ground the context before logical operations are performed.

Architecture

Overview of the VisDoT framework, illustrating both the data generation pipeline (Perception-Following Question Generation) and the inference pipeline (Decomposition-of-Thought).

Evaluation Highlights

+11.2% improvement on ChartQA benchmark using InternVL fine-tuned with VisDoT.
+33.2% improvement on the newly introduced VisDoTQA benchmark compared to baselines.
Surpasses GPT-4o on the challenging ChartQAPro benchmark.
Increases performance by +2.2% absolute on MMMU (open-domain VQA) over an identical Chain-of-Thought backbone.

Breakthrough Assessment

8/10

Strong improvements on specialized chart benchmarks and valid generalization to open-domain VQA. The integration of cognitive psychology (graphical perception) into the prompting/training pipeline is a significant methodological contribution.

⚙️ Technical Details

Problem Definition

Setting: Visual Question Answering on Charts and Visualized Data

Inputs: Image I and complex Question Q

Outputs: Final Answer An, derived from a sequence of intermediate answers

Pipeline Flow

Input: Image + User Question
Decomposition Phase: Generate ordered sub-questions (Perception-oriented -> Logic-oriented)
Problem Solving Phase: Sequentially answer sub-questions using Image + Context
Output: Final Answer accumulation

System Modules

Question Decomposer (Reasoning Engine)

Split the main question Q into a set of sub-questions, strictly prioritizing perception sub-questions (Qp) before logic sub-questions (Ql).

Model or implementation: InternVL (Fine-tuned)

Sequential Solver (Reasoning Engine)

Answer each sub-question sequentially, using the image and previous answers as context.

Model or implementation: InternVL (Fine-tuned)

Novel Architectural Elements

Strict separation of 'Perception' and 'Logic' phases within the inference generation process, formalized via the probability decomposition P({Qp, Ql} | Q).

Modeling

Base Model: InternVL

Training Method: Supervised Fine-Tuning (SFT) with Decomposition-of-Thought data

Objective Functions:

Purpose: Maximize likelihood of the decomposed reasoning path and final answer.

Formally: Decomposed VQA objective summing log-probabilities of sub-question generation and sequential answer generation.

Training Data:

VisDoTQA: A perception-following dataset constructed using automated generation based on four perceptual tasks (Position, Length, Pattern, Extract).
Questions range from OCR-style queries to explicit visual descriptor queries.

Compute: Not reported in the paper

Comparison to Prior Work

vs. CoT-based methods: VisDoT explicitly enforces a 'Perception First' order, whereas CoT often attempts logic without grounding visual elements.
vs. Modular VQA (e.g., NS-VQA): VisDoT performs decomposition and reasoning within a single end-to-end LVLM rather than using separate neuro-symbolic modules.
vs. ChartGemma: VisDoT focuses on cognitive graphical perception tasks (Length, Position) rather than just general visual-element instruction tuning.

Limitations

Relies on the capability of the underlying LVLM to accurately perform the initial decomposition.
The paper does not explicitly detail the failure modes where decomposition might lead to cascading errors.
Computational cost of sequential sub-question generation is higher than direct answering (inherent to DoT/CoT methods).

Reproducibility

VisDoTQA dataset generation methodology and prompting templates (Appendix D) are described. Code URL is not provided in the text. InternVL is an open model family.

📊 Experiments & Results

Evaluation Setup

Evaluated on standard chart reasoning benchmarks and open-domain VQA benchmarks.

Benchmarks:

ChartQA (Chart Question Answering)
ChartQAPro (Complex Chart Reasoning)
VisDoTQA (Perception-following Chart QA) [New]
POPE (Object Hallucination Evaluation)
MMMU (Multi-discipline Multimodal Understanding)

Metrics:

Accuracy (assumed standard for QA tasks)
Exact Match
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative performance on open-domain VQA benchmarks demonstrating the generalizability of the DoT prompting strategy over standard CoT.
POPE	Performance Improvement (Absolute %)	0.0	1.43	+1.43
MMMU	Performance Improvement (Absolute %)	0.0	2.2	+2.2

Experiment Figures

Examples of perception-following questions generated by the framework.

The specific structure of the Decomposition-of-Thought (DoT) prompt.

Main Takeaways

VisDoT achieves a +11.2% improvement on ChartQA and +33.2% on VisDoTQA, confirming the efficacy of perception-grounded training.
The method surpasses GPT-4o on ChartQAPro, indicating that specialized grounding training can outperform larger, general-purpose models on complex visual reasoning.
The DoT strategy generalizes beyond charts: applying the same perception-logic decomposition to POPE and MMMU yields consistent gains over standard CoT.
Decomposition reduces hallucination by forcing the model to locate visual elements (grounding) before attempting to reason about them.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Models (VLMs)
Basic knowledge of Chain-of-Thought (CoT) prompting
Familiarity with chart components (legends, axes, marks)

Key Terms

LVLM: Large Vision-Language Model—a model capable of processing both images and text to generate text outputs.

DoT: Decomposition-of-Thought—a prompting strategy that breaks a complex question into sequential sub-questions, specifically separating visual perception from logical reasoning.

CoT: Chain-of-Thought—a prompting method that encourages models to generate intermediate reasoning steps before the final answer.

Grounding: The process of linking abstract linguistic concepts (e.g., 'the highest bar') to specific concrete visual regions or features in an image.

Graphical Perception: The visual decoding process humans use to interpret charts, involving tasks like estimating length, position, or angle.

Visual Primitives: Basic visual attributes such as color, shape, spatial coordinates, and length that constitute complex visualizations.

InternVL: A specific Large Vision-Language Model architecture used as the backbone for fine-tuning in this paper.