Multimodal Chain-of-Thought Reasoning in Language Models

📝 Paper Summary

Multimodal Reasoning Chain-of-Thought (CoT) Prompting

Multimodal-CoT separates rationale generation and answer inference into a two-stage framework incorporating dense vision features, enabling small models (<1B parameters) to perform effective reasoning without hallucination.

Core Problem

Small language models (<1B parameters) fail to perform effective Chain-of-Thought reasoning, often hallucinating rationales that degrade answer accuracy compared to direct prediction.

Why it matters:

Existing CoT methods focus on large language models (>100B params), which are resource-intensive to deploy.
Converting images to captions for LLMs results in significant information loss.
Naive fine-tuning of small models for CoT causes 'hallucinated rationales' where the reasoning contradicts the image or leads to wrong answers (e.g., accuracy drops from 81.63% to 69.32%).

Concrete Example: In a ScienceQA physics problem about magnets, a text-only model hallucinates the rationale 'The south pole of one magnet is closest to the south pole of the other' because it cannot see the image, leading to an incorrect answer. The proposed method uses vision features to correctly identify the poles.

Key Novelty

Two-Stage Multimodal-CoT Framework

Separates the reasoning process into two distinct stages: (1) generating a rationale based on image and text, and (2) inferring the answer based on the image, text, and generated rationale.
Injects dense vision features (via ViT) directly into the language model's encoder rather than relying on lossy image captions.

Architecture

The two-stage framework for Multimodal-CoT. Stage 1 generates the rationale from language and vision inputs. Stage 2 appends the generated rationale to the language input to infer the final answer.

Evaluation Highlights

Achieves 85.31% accuracy on ScienceQA with a model under 1B parameters, surpassing the previous text-only baseline of 81.63%.
Corrects 60.7% of hallucination errors observed in text-only baselines by incorporating vision features.
Outperforms caption-based multimodal approaches (79.37%) by using deep fusion of vision features.

Breakthrough Assessment

8/10

Significant because it demonstrates that small models (<1B params) can perform effective CoT reasoning if the architecture (two-stage) and modality fusion (vision features) are handled correctly, challenging the assumption that CoT requires >100B parameters.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Question Answering with intermediate rationale generation

Inputs: Language input X_language (Question, Context, Options) and Vision input X_vision (Image)

Outputs: Target text Y (Rationale R in stage 1, Answer A in stage 2)

Pipeline Flow

Stage 1: Rationale Generation (Image + Question -> Rationale)
Stage 2: Answer Inference (Image + Question + Rationale -> Answer)

System Modules

Vision Extractor (Input Processing)

Extract patch-level features from input images

Model or implementation: ViT (Vision Transformer)

Language Encoder (Input Processing)

Encode text inputs into hidden representations

Model or implementation: Transformer Encoder (from FLAN-Alpaca-Base)

Feature Fusion

Fuse vision and language representations

Model or implementation: Attention/Interaction Layer

Decoder

Generate text output (Rationale or Answer)

Model or implementation: Transformer Decoder

Novel Architectural Elements

Two-stage pipeline where the exact same architecture is instantiated twice: once for generating rationales and once for inferring answers using those rationales
Integration of frozen ViT features directly into the fine-tuning process of a small (<1B) language model decoder

Modeling

Base Model: FLAN-Alpaca-Base (200M parameters)

Training Method: Fine-tuning

Objective Functions:

Purpose: Minimize negative log-likelihood of the target sequence (rationale or answer).

Formally: L = - sum(log p(Y_i | X_lang, X_vis, Y_<i))

Training Data:

ScienceQA dataset (multimodal MCQs with annotated reasoning chains)

Compute: Deployable on consumer-grade GPUs (e.g., 32G memory) due to small model size

Comparison to Prior Work

vs. Standard CoT (Zero-shot/Few-shot): Proposed method fine-tunes small models instead of prompting large frozen ones
vs. Caption-based Multimodal CoT: Proposed method fuses dense vision features instead of converting images to text captions
vs. VisualBERT [not cited in paper]: Proposed method focuses specifically on generating explicit reasoning chains (CoT) rather than direct answer classification

Limitations

Reliance on annotated reasoning chains for supervision (requires datasets like ScienceQA)
Two-stage inference increases computational cost compared to direct prediction
Analysis focused on small models (<1B); scalability to larger multimodal models not explicitly tested in this text

Reproducibility

Code: https://github.com/amazon-science/mm-cot

Code is publicly available at https://github.com/amazon-science/mm-cot. The paper uses the ScienceQA benchmark which is public.

📊 Experiments & Results

Evaluation Setup

Multimodal multiple-choice question answering

Benchmarks:

ScienceQA (Multimodal scientific reasoning)
A-OKVQA (Visual Question Answering requiring outside knowledge)

Metrics:

Accuracy (%)
RougeL (for rationale quality)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ablation study on ScienceQA using FLAN-Alpaca-Base demonstrates that one-stage CoT degrades performance, while the proposed Multimodal-CoT (two-stage + vision features) significantly improves it.
ScienceQA	Accuracy	81.63	85.31	+3.68
ScienceQA	Accuracy	69.32	85.31	+15.99
ScienceQA	Accuracy	79.37	85.31	+5.94
ScienceQA	RougeL	90.73	93.46	+2.73

Experiment Figures

Pie charts analyzing error types and correction rates.

Main Takeaways

Naive CoT (generating rationale before answer in one stage) drastically reduces accuracy (-12.31%) in small models due to early stopping or hallucination.
Separating rationale generation and answer inference (two-stage) allows the model to leverage rationales effectively, provided the rationales are grounded in vision.
Using image captions provides only marginal gains (+0.80%) over text-only baselines; dense vision features are required for significant improvement.
The method successfully mitigates hallucination: 60.7% of hallucination errors in the text-only baseline were corrected by adding vision features.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Encoder-Decoder)
Chain-of-Thought (CoT) prompting
Vision Transformers (ViT)

Key Terms

CoT: Chain-of-Thought—a technique where models generate intermediate reasoning steps before producing a final answer

Hallucination: In this context, generating reasoning chains that are factually incorrect or disconnected from the provided visual evidence

Rationale: The intermediate natural language explanation generated by the model to support its final answer

ViT: Vision Transformer—a model architecture that processes images as sequences of patches, used here to extract visual features

RougeL: A metric used to evaluate the quality of text generation (specifically the rationale) by measuring the longest common subsequence with reference text

1B-models: Language models with fewer than 1 billion parameters, which typically struggle with CoT reasoning compared to larger models