MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

📝 Paper Summary

Multimodal Evaluation Benchmark Multimodal Content Comprehension

MM-BigBench evaluates Multimodal Large Language Models (MLLMs) on tasks requiring deep comprehension of text-image relationships (like sarcasm and sentiment), moving beyond simple visual recognition questions.

Core Problem

Existing MLLM benchmarks (e.g., MME, MM-Vet) focus on visual-centric tasks like recognition or spatial reasoning where text is merely a query, neglecting 'multimodal content comprehension' where understanding the text content itself is crucial.

Why it matters:

Real-world applications like social media analysis require detecting nuanced relationships between text and images (e.g., sarcasm, hate speech) which current benchmarks miss
Prior evaluations often assess models or instructions in isolation, ignoring the specific adaptability between model architectures and instruction formats
There is a lack of understanding regarding how MLLMs perform on tasks that require semantic fusion of both modalities rather than just visual grounding

Concrete Example: In Multimodal Sarcasm Recognition, a tweet text might say 'Summer fun day' while the image depicts a gloomy storm. A model trained only on standard VQA might identify 'storm' but fail to detect the sarcasm because it doesn't deeply comprehend the semantic conflict between the text and image content.

Key Novelty

MM-BigBench Evaluation Framework

Shifts evaluation focus from visual-dominant tasks (VQA) to multimodal content comprehension tasks (MSA, Sarcasm, Hate Speech) where text and image carry equal semantic weight
Introduces a multi-dimensional metric suite assessing not just accuracy but also Stability (consistency across instructions) and Adaptability (how well a model pairs with specific prompt styles)
Benchmarks 20 models (including 14 MLLMs) using 10 diverse manually designed instructions per task to analyze sensitivity to prompt engineering

Architecture

A conceptual comparison between traditional Vision-Language tasks (above dotted line) and Multimodal Content Comprehension tasks (below dotted line)

Evaluation Highlights

InstructBLIP achieves the highest total accuracy score (736.72) across all datasets, outperforming the next best model (BLIP-2 at 637.21) significantly
Encoder-Decoder models (like Flan-T5-XXL) generally outperform Decoder-only models (like LLaMA series) on these comprehension tasks, with Flan-T5-XXL scoring 618.23 total vs. LLaMA-2-13B's 549.50
Instruction #2 (Question-Answer format) shows the highest adaptability, achieving Top-K performance 340 times across models, compared to just 80 times for Instruction #7

Breakthrough Assessment

7/10

Provides a necessary shift in evaluation focus towards deeper semantic alignment tasks. While it doesn't propose a new model architecture, the comprehensive benchmarking of instruction sensitivity is valuable.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot classification and question answering on multimodal (text + image) datasets

Inputs: An image V, a text context T (e.g., a tweet or caption), and a specific textual instruction I

Outputs: A text response mapped to a label space (e.g., Positive/Negative, Yes/No)

Pipeline Flow

Input Construction (Task Definition + Context + Question + Options + Image)
Prompting (Apply 1 of 10 instruction templates)
Inference (Model generates text response)
Evaluation (Map response to label and calculate metrics)

System Modules

Instruction Generator

Wraps dataset instances into specific prompt formats (e.g., QA style, conversation style)

Model or implementation: Rule-based templates

Multimodal LM

Processes image and text prompt to generate an answer

Model or implementation: Various (e.g., InstructBLIP, LLaVA, LLaMA-Adapter)

Novel Architectural Elements

Introduction of 'Adaptability' metric: Quantifies how frequently a specific instruction yields Top-K performance for a specific model
Introduction of 'Stability' metric: Measures the standard deviation of accuracy across different instructions (Model Stability) or across different models (Instruction Stability)

Modeling

Base Model: Various (Evaluates 20 models including ChatGPT, LLaMA-1/2, Flan-T5, LLaVA, BLIP-2, InstructBLIP)

Training Method: Zero-shot evaluation (no training performed in this paper)

Adaptation: None (Pre-trained models used as-is)

Trainable Parameters: 0 (Inference only)

Compute: Not reported in the paper

Comparison to Prior Work

vs. MME/MM-Vet: MM-BigBench focuses on 'content comprehension' (sentiment, sarcasm) where text is content-bearing, not just a query
vs. Standard NLP Benchmarks: MM-BigBench incorporates the visual modality, unlike pure text evaluations (e.g., GLUE)
Novelty: First comprehensive assessment of instruction adaptability for MLLMs on these specific semantic tasks

Limitations

Evaluates only zero-shot performance, neglecting few-shot or fine-tuning capabilities
Focuses on classification/QA accuracy, potentially missing nuances in generative reasoning quality
Some models (e.g., LaVIN) have unfair advantages due to training on specific datasets (ScienceQA) included in the benchmark

Reproducibility

Code: https://github.com/declare-lab/MM-BigBench

Code publicly available at https://github.com/declare-lab/MM-BigBench. The paper evaluates public models and uses public datasets (MVSA, Twitter-15/17, Hate, Sarcasm, etc.). Instruction templates are fully listed in the appendix.

📊 Experiments & Results

Evaluation Setup

Zero-shot inference on 14 datasets across 6 tasks using 10 instruction templates per dataset

Benchmarks:

MVSA-Single/Multiple (Multimodal Sentiment Analysis (MSA))
Twitter-2015/2017 (Multimodal Aspect-Based Sentiment Analysis (MABSA))
Hate Memes (Multimodal Hateful Memes Recognition (MHMR))
Sarcasm (Multimodal Sarcasm Recognition (MSR))
MNRE (Multimodal Relation Extraction (MRE))
ScienceQA (Visual Question Answering (VQA))

Metrics:

Accuracy (Acc)
Best Performance (Upper bound across instructions)
Mean Relative Gain (MRG)
Stability (Standard Deviation)
Adaptability (Top-K Hit Ratio)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance comparison showing the dominance of InstructBLIP and the strong performance of pure text models (Flan-T5) over some MLLMs.
Total Score (Sum of Acc across all datasets)	Total Accuracy	549.50	736.72	+187.22
Total Score (Sum of Acc across all datasets)	Total Accuracy	359.39	637.21	+277.82
Total Score (Sum of Acc across all datasets)	Total Accuracy	549.50	618.23	+68.73
Adaptability analysis reveals which instruction formats perform best across all models.
Cross-Model Average	Top-K Hit Ratio (Adaptability)	109.51	340.48	+230.97

Experiment Figures

The 10 different instruction templates used for the ScienceQA task, demonstrating the variations in prompting strategies (e.g., simple QA, conversation, adding options)

Main Takeaways

Encoder-Decoder models (like those using Flan-T5) consistently outperform Decoder-only models (like LLaMA-based ones) on multimodal comprehension tasks
Instruction tuning significantly improves model stability; InstructBLIP is more stable across varying prompts than BLIP-2
Simple 'Question-Answer' formatted instructions (Instruction #2) yield the best performance across the majority of models compared to conversational or complex instruction formats
Pure text models like Flan-T5-XXL can outperform many dedicated MLLMs on these tasks, suggesting some MLLMs may not be effectively leveraging visual information for semantic comprehension
Larger model sizes consistently correlate with better performance within the same model family (e.g., LLaVA-13B > LLaVA-7B)

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multimodal Large Language Models (MLLMs)
Familiarity with standard NLP tasks (Sentiment Analysis, Named Entity Recognition)
Knowledge of Zero-shot evaluation settings

Key Terms

MLLM: Multimodal Large Language Model—AI models capable of processing and reasoning over both text and image inputs

MSA: Multimodal Sentiment Analysis—detecting sentiment (positive/negative/neutral) from text-image pairs

MABSA: Multimodal Aspect-Based Sentiment Analysis—identifying sentiment toward specific aspects/entities within multimodal content

MSR: Multimodal Sarcasm Recognition—detecting sarcasm that often arises from the contradiction between text and image

MHMR: Multimodal Hateful Memes Recognition—identifying hate speech in memes where meaning depends on text-image context

VQA: Visual Question Answering—answering questions based on visual content

MRE: Multimodal Relation Extraction—identifying relationships between entities in a text-image pair

Encoder-Decoder: A neural architecture (like T5) that encodes input into a representation before decoding it, often used for sequence-to-sequence tasks

Decoder-only: A neural architecture (like GPT or LLaMA) that predicts the next token based on history, common in generative LLMs