Explain Before You Answer: A Survey on Compositional Visual Reasoning

📝 Paper Summary

Compositional Visual Reasoning (CVR) Multimodal Large Language Models

This survey reviews over 260 papers to formalize Compositional Visual Reasoning, advocating for modular, step-by-step inference over monolithic models to improve grounding, robustness, and interpretability in multimodal AI.

Core Problem

Monolithic vision-language models treat inputs holistically, leading to reliance on spurious dataset biases, diminishing returns in complex reasoning, and a lack of interpretable, human-like decomposition.

Why it matters:

Monolithic models frequently hallucinate by relying on linguistic priors (e.g., assuming bananas are always yellow) rather than visual evidence.
Scaling data and compute yields diminishing returns for tasks requiring multi-hop reasoning or spatial understanding.
Black-box architectures lack transparency, making them unsuitable for high-stakes applications like medical imaging or autonomous driving where reasoning steps must be verified.

Concrete Example: When asked about the color of a green banana, a monolithic model might incorrectly answer 'yellow' due to statistical linguistic priors. In contrast, a compositional model would explicitly detect the object 'banana', extract its specific visual attribute 'color', and ground the answer in the actual pixel data.

Key Novelty

Unified Taxonomy of Compositional Visual Reasoning

Formalizes a five-stage evolutionary roadmap: moving from prompt-enhanced pipelines and tool-enhanced LLMs/VLMs to Chain-of-Thought reasoning and Unified Agentic VLMs.
Synthesizes the benefits of compositionality into seven key dimensions, including cognitive alignment, semantic fidelity, modular reuse, and hallucination mitigation.

Architecture

Contrast between Monolithic Visual Reasoning and Compositional Visual Reasoning (CVR) paradigms.

Evaluation Highlights

Cataloged 260+ papers from top venues (CVPR, ICCV, NeurIPS, etc.) spanning January 2023 to May 2025.
Identified and reviewed 60+ benchmarks focusing on dimensions such as grounding accuracy and chain-of-thought faithfulness.

Breakthrough Assessment

9/10

A timely and comprehensive synthesis of a rapidly expanding field. It provides a necessary structured taxonomy and defines the 'CVR' paradigm distinct from general multimodal learning.

⚙️ Technical Details

Problem Definition

Setting: Visual Reasoning tasks mapping an image-query pair (v, q) to an answer y via intermediate steps.

Inputs: Visual input v (image) and textual query q.

Outputs: Answer y (text, selection, or grounding), derived via a sequence of n intermediate reasoning steps S = {s1, s2, ..., sn}.

Pipeline Flow

Input Decomposition (Decompose query q into reasoning steps S)
Visual Perception/Grounding (Execute steps using visual tools or modules)
Reasoning/Synthesis (Combine outputs to derive answer y)

System Modules

Decomposition Module

Breaks down the complex query into intermediate steps (e.g., object identification, attribute inference)

Model or implementation: Generic (LLM or rule-based)

Perception/Grounding Module

Extracts visual information for each step (detection, attribute recognition, depth estimation)

Model or implementation: Generic (Visual Tools or VLM Perception Head)

Reasoning Engine

Synthesizes perceptual outputs to produce the final answer

Model or implementation: Generic (Symbolic Program Execution or LLM Inference)

Novel Architectural Elements

Formalization of the transition from static 'Prompt-Enhanced' pipelines to dynamic 'Unified Agentic VLMs'
Explicit representation of intermediate reasoning steps S prior to answer generation, contrasting with direct (v,q)->y mapping

Modeling

Base Model: Review covers multiple architectures (LLMs, VLMs, Agentic frameworks)

Training Method: Survey reviews various methods including Prompting, Tool-Use, and Chain-of-Thought Fine-tuning

Adaptation: Varies by reviewed paper (from frozen API calls to full parameter tuning)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Monolithic Models: CVR explicitly decomposes tasks into steps to improve grounding and reduce bias.
vs. Early Neurosymbolic: Modern CVR utilizes LLMs/VLMs for flexible planning and generalization rather than rigid symbolic parsers.

Limitations

Survey scope limited to 2D image-based modalities (excludes video/3D).
LLM-based reasoning can still suffer from hallucinations despite compositional structure.
Scalable supervision for intermediate reasoning steps remains a challenge.
Current benchmarks may not fully probe high-resolution perception or complex chain-of-thought faithfulness.

Reproducibility

As a survey, this paper does not introduce a new model to reproduce. It catalogs existing works (260+ papers) and benchmarks.

📊 Experiments & Results

Evaluation Setup

Systematic review of existing benchmarks rather than new experimental results.

Benchmarks:

CLEVR (Synthetic Visual Reasoning)
GQA (Real-world Visual Reasoning / VQA)
VCR (Visual Commonsense Reasoning)

Metrics:

Grounding accuracy
Chain-of-thought faithfulness
High-resolution perception
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Monolithic models suffer from 'diminishing returns' where scaling data/compute does not proportionally improve multi-hop reasoning or spatial understanding.
Compositional approaches are necessary to prevent models from exploiting statistical dataset biases (e.g., linguistic priors).
The field is shifting from static template-based prompting toward dynamic, feedback-driven 'Agentic' architectures.
Explicit grounding of intermediate steps is critical for reducing hallucinations and improving factual consistency.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Models (VLMs) like CLIP and LLaVA
Familiarity with visual tasks: VQA, Visual Grounding, Scene Graphs
Basic knowledge of Large Language Models (LLMs) and prompting

Key Terms

CVR: Compositional Visual Reasoning—a paradigm that decomposes visual tasks into structured steps (objects, attributes, relations) rather than mapping inputs directly to answers.

Monolithic Visual Reasoning: End-to-end architectures (like CLIP or standard LLaVA) that encode vision and language jointly to predict answers without explicit intermediate reasoning steps.

Grounding: The process of linking abstract concepts (e.g., 'the red ball') to specific regions or pixels in the visual input.

Chain-of-Thought: A reasoning technique where the model generates a sequence of intermediate logical steps before producing the final answer.

LLM: Large Language Model—AI models trained on vast text data to understand and generate human language.

VLM: Vision-Language Model—AI models that process and relate both image and text inputs.

Systematic Generalization: The ability to understand and reason about novel combinations of known concepts (e.g., recognizing a 'purple giraffe' after seeing 'purple' and 'giraffe' separately).

Hallucination: When a model generates plausible but factually incorrect information, often driven by training data biases rather than the actual input.