Compositional Chain-of-Thought Prompting for Large Multimodal Models

📝 Paper Summary

Multimodal Reasoning Visual Question Answering (VQA) Prompt Engineering

CCoT improves multimodal reasoning by prompting Large Multimodal Models to generate their own scene graphs as an intermediate reasoning step, without needing fine-tuning or external annotations.

Core Problem

State-of-the-art Large Multimodal Models (LMMs) often view images as a simple "bag of objects," failing to understand compositional relationships (attributes and spatial relations between objects).

Why it matters:

Current models struggle with questions requiring precise spatial or attribute understanding (e.g., distinguishing a person *on* a horse vs. *beside* it)
Existing solutions using Scene Graphs (SGs) typically require expensive ground-truth annotations which are not scalable
Fine-tuning models on Scene Graph data can lead to catastrophic forgetting of the original pre-training objectives

Concrete Example: Given an image of a desk, a standard LMM might list "laptop, mouse, books." However, it fails to describe exactly *how* they are situated (e.g., "a stack of books on a laptop"). CCoT generates a structured graph first to capture these relations before answering.

Key Novelty

Compositional Chain-of-Thought (CCoT)

Zero-shot prompting strategy that forces the LMM to first generate a Scene Graph (SG) in JSON format containing objects, attributes, and relationships based on the image and task
Uses this self-generated SG as an explicit context in a second prompt to answer the user's question, effectively creating a reasoning bridge between the visual input and the final textual response

Architecture

The CCoT inference pipeline: (1) Image + SG Generation Prompt creates a JSON Scene Graph. (2) Image + Original Prompt + Generated Scene Graph creates the Final Response.

Evaluation Highlights

Significant improvement on Winoground (a compositional benchmark) using GPT-4V, outperforming the previous state-of-the-art (SGVL) which required fine-tuning on annotated scene graphs
Consistently outperforms baseline zero-shot prompting and standard Chain-of-Thought (CoT) across LLaVA-1.5, InstructBLIP, SPHINX, and GPT-4V
Improves performance on general multimodal benchmarks like SEEDBench and MMBench, not just compositional tasks

Breakthrough Assessment

8/10

Simple yet highly effective prompting strategy that solves a known LMM weakness (compositionality) without training or data costs. Demonstrates that structured intermediate representations work better than free-text reasoning for vision tasks.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot visual question answering and multimodal reasoning

Inputs: An image I and a task prompt P_in (e.g., a question)

Outputs: A textual response R answering the prompt based on the image

Pipeline Flow

Step 1: Scene Graph Generation Prompt
Step 2: Response Generation Prompt

System Modules

Scene Graph Generator

Generate a structured JSON scene graph describing objects, attributes, and relationships

Model or implementation: The target LMM itself (e.g., LLaVA-1.5, GPT-4V)

Response Generator

Produce the final answer using the original inputs plus the generated scene graph

Model or implementation: The same target LMM (frozen)

Novel Architectural Elements

Two-stage inference pipeline utilizing self-generated Scene Graphs (JSON) as the intermediate Chain-of-Thought reasoning step instead of natural language narrative

Modeling

Base Model: Evaluated on InstructBLIP-13B, LLaVA-1.5-13B, SPHINX, and GPT-4V

Compute: Inference-only; requires infrastructure sufficient to run the respective base models (e.g., 13B parameters or API access for GPT-4V)

Comparison to Prior Work

vs. VidIL/DDCoT: Uses structured Scene Graphs (objects/relations/attributes) instead of unstructured captions, enabling better compositionality
vs. Multimodal-CoT: Zero-shot (inference only) vs. requires expensive fine-tuning on reasoning data
vs. SGVL: Does not require ground-truth Scene Graph annotations or fine-tuning (avoids catastrophic forgetting)
+ 1 more
vs. Standard Zero-Shot CoT ('Let's think step by step'): Incorporates explicit visual structure (SG) rather than just language-based reasoning

Limitations

Reliance on the LMM's ability to generate accurate scene graphs; if the model fails to detect an object initially, the graph will be flawed
Increased token usage and latency due to the two-step prompting process and verbose JSON generation
Performance gains vary across different LMM architectures

Reproducibility

Code: https://github.com/chancharikmitra/CCoT

Code is publicly available at https://github.com/chancharikmitra/CCoT. The method relies on prompt templates provided in the paper. Pre-trained weights for LLaVA, InstructBLIP, and SPHINX are available via their original repositories. GPT-4V is accessed via API.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation across multiple vision-language benchmarks

Benchmarks:

Winoground (Compositional Visual Understanding (matching captions with swapped objects/relations))
WHOOPS! (Compositional VQA (images violating visual commonsense))
SEEDBench (Image split) (General Multimodal Reasoning)
MMBench (General Multimodal Reasoning)
LLaVA-Bench In-the-Wild (Detailed Visual Description/QA)

Metrics:

Text Score
Image Score
Group Score (Winoground)
Accuracy (others)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CCoT consistently improves performance on the Winoground compositional benchmark compared to base models and standard Zero-Shot CoT.
Winoground	Text Score	29.25	34.00	+4.75
Winoground	Group Score	11.75	14.50	+2.75
Winoground	Text Score	60.75	64.50	+3.75
CCoT also shows improvements on general multimodal benchmarks like SEEDBench and MMBench.
SEEDBench-Image	Accuracy	68.21	68.49	+0.28
MMBench	Accuracy	66.56	67.58	+1.02

Experiment Figures

Qualitative comparison of Base model vs. CCoT outputs on difficult visual questions.

Main Takeaways

CCoT improves compositional reasoning significantly (e.g., Winoground) where standard CoT ('Let's think step by step') often fails or degrades performance.
The method generalizes well across different LMM architectures (LLaVA, InstructBLIP, GPT-4V) without any model-specific tuning.
Generating Scene Graphs helps mitigate the 'bag of objects' failure mode by forcing the model to explicitly articulate relationships and attributes.
Improvements are observed on both specialized compositional tasks (WHOOPS!, Winoground) and general purpose benchmarks (SEEDBench, MMBench).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Multimodal Models (LMMs) and how they process image-text pairs
Familiarity with Chain-of-Thought (CoT) prompting
Basic concept of Scene Graphs (nodes = objects, edges = relationships)

Key Terms

Scene Graph (SG): A structured representation of an image where nodes are objects and edges represent relationships (e.g., 'cup on table') or attributes (e.g., 'red cup')

LMM: Large Multimodal Model—an AI model capable of processing and reasoning over both text and images (e.g., GPT-4V, LLaVA)

Compositionality: The ability to understand a complex scene by understanding its parts (objects) and how they combine (relationships/attributes), rather than just listing isolated elements

Chain-of-Thought (CoT): A prompting technique where the model is asked to generate intermediate reasoning steps before the final answer

Zero-shot: Performing a task without seeing any specific training examples for that task beforehand

Catastrophic forgetting: A phenomenon where a model forgets previously learned information upon learning new information (e.g., fine-tuning on scene graphs makes it forget general knowledge)

JSON: JavaScript Object Notation—a structured text format used here to force the model to organize scene graph outputs strictly