Are We on the Right Way for Evaluating Large Vision-Language Models?

📝 Paper Summary

Vision-Language Models (LVLMs) Model Evaluation

The authors identify severe data leakage and visual independence issues in existing LVLM benchmarks and propose MMStar, a manually curated benchmark requiring true visual understanding.

Core Problem

Existing LVLM benchmarks contain many samples where visual content is unnecessary (answerable by text alone) and suffer from unintentional data leakage where models memorize answers during training.

Why it matters:

Evaluation samples that don't require vision degrade LVLM assessment into merely testing the text-only LLM backbone
Unintentional leakage leads to inflated scores and unfair comparisons, misguiding research on actual multi-modal architectural gains
High scores on flawed benchmarks create a false sense of progress while models may lack genuine multi-modal reasoning capabilities

Concrete Example: GeminiPro achieves 42.9% on the MMMU benchmark without seeing any images, and Sphinx-X-MoE gets 43.6% on MMMU without images, surpassing its own LLM backbone by 17.9% due to memorization.

Key Novelty

MMStar Benchmark & Metric Suite

Developed an automated pipeline using 8 LLMs to filter out samples answerable without images, followed by strict human curation to ensure visual dependency
Introduced 'Multi-modal Gain' (MG) and 'Multi-modal Leakage' (ML) metrics to quantify how much performance comes from actual visual understanding versus training data memorization

Architecture

The data curation pipeline for constructing MMStar.

Evaluation Highlights

GeminiPro outperforms random choice by over 24% on average across six existing benchmarks without accessing any visual input
GPT-4V (high-res) achieves the highest accuracy of 57.1% on MMStar, confirming it as the state-of-the-art vision-language model
On the MMMU benchmark, Sphinx-X-MoE achieves 43.6% accuracy without images, indicating severe data leakage in its training

Breakthrough Assessment

9/10

Crucial reality check for the field. Exposes fundamental flaws in current SOTA benchmarks and provides a cleaned dataset plus metrics to measure leakage, likely shifting future evaluation standards.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of Large Vision-Language Models (LVLMs) on multi-modal tasks

Inputs: Image I and textual Question Q (and Options O)

Outputs: Predicted Answer A

Pipeline Flow

Automated Coarse Filtering (LLM Inspectors)
Manual Review & Curation
Evaluation with New Metrics

System Modules

Automated Filter (Data Curation)

Identify samples answerable without vision

Model or implementation: Ensemble of 8 LLMs (GPT4-Turbo, GeminiPro, LLaMA-70B, etc.)

Manual Reviewer (Data Curation)

Verify visual dependency and categorization

Model or implementation: Human Experts

Metric Calculator

Compute MG and ML metrics

Model or implementation: Mathematical Formulation

Novel Architectural Elements

Two-stage filtration pipeline utilizing an ensemble of 'LLM Inspectors' to detect visual-independent samples
Definition of Multi-modal Gain (MG) and Multi-modal Leakage (ML) as standardized benchmarks metrics

Modeling

Base Model: Evaluates 16 LVLMs (e.g., GPT-4V, GeminiPro-Vision, LLaVA series) and their LLM backbones

Compute: NVIDIA A100 GPUs used for non-API-based evaluation

Comparison to Prior Work

vs. MMBench/MMMU: MMStar filters out visual-independent questions (where text alone suffices), whereas MMBench/MMMU contain significant portions (20%+) of such questions
vs. ScienceQA: MMStar ensures visual dependency; ScienceQA has >50% questions solvable by text-only LLMs
Novelty: First benchmark to explicitly quantify and subtract 'leakage' and 'text-only' priors from multi-modal performance scores

Limitations

Evaluation relies on the capabilities of the specific LLM inspectors used; stronger future LLMs might solve even 'visual-dependent' questions via world knowledge
Manual curation is labor-intensive and may not scale to extremely large datasets
The benchmark focuses on 6 core capabilities, which may not cover all emerging multi-modal tasks

Reproducibility

Code: https://github.com/MMStar-Benchmark/MMStar

publicly available (https://github.com/MMStar-Benchmark/MMStar). The benchmark dataset and evaluation code are released. Model weights for open-source models evaluated are available via their respective repositories.

📊 Experiments & Results

Evaluation Setup

Multi-choice question answering across diverse domains

Benchmarks:

MMStar (Visual-indispensable multi-modal QA) [New]
MMBench (General multi-modal QA)
MMMU (Multi-discipline expert QA)
ScienceQA (Scientific QA)
MathVista (Mathematical reasoning)
AI2D (Diagram understanding)
SEED (General multi-modal QA)

Metrics:

Accuracy
Multi-modal Gain (MG)
Multi-modal Leakage (ML)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on the new MMStar benchmark highlights the difficulty of truly vision-dependent tasks.
MMStar	Accuracy	41.8	57.1	+15.3
MMStar	Accuracy	51.4	57.1	+5.7
Investigation of visual independency (answering without images) on existing benchmarks.
ScienceQA	Abnormal Hit Rate	0	57.2	+57.2
MMMU	Accuracy (Text-only)	24.8	43.6	+18.8
Data Leakage analysis showing LVLMs memorizing training data.
MMMU	Accuracy (Text-only)	25.7	43.6	+17.9

Experiment Figures

Abnormal hit rates across 6 benchmarks, showing the percentage of questions answerable by text-only LLMs.

Radar chart and distribution of the 6 core capabilities and 18 detailed axes in MMStar.

Main Takeaways

Many existing benchmarks (ScienceQA, AI2D) have high rates of questions answerable by text alone, failing to test visual capabilities.
Significant data leakage exists: LVLMs often outperform their base LLMs on text-only versions of multi-modal tasks, proving they memorized the samples during training.
MMStar is significantly harder than previous benchmarks; even GPT-4V only achieves 57.1% accuracy.
Fine-grained perception and logical reasoning remain major challenges for current SOTA LVLMs.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Vision-Language Models (LVLMs) architecture
Familiarity with standard multi-modal benchmarks (MMBench, MMMU, ScienceQA)
Concept of data leakage in machine learning evaluation

Key Terms

LVLM: Large Vision-Language Model—a model combining a visual encoder and an LLM to process image-text inputs

Data Leakage: When evaluation data is inadvertently included in the model's training set, allowing it to memorize answers rather than reason

Visual Dependency: The requirement that a question cannot be correctly answered without processing the accompanying visual content

MG: Multi-modal Gain—a metric measuring the performance improvement of an LVLM when visual input is provided versus text-only input

ML: Multi-modal Leakage—a metric measuring how much better the LVLM's text-only performance is compared to its base LLM, indicating memorization

LLM Inspector: Using a text-only Large Language Model to attempt answering multi-modal questions; if successful, the visual content is deemed unnecessary