MMBench: Is Your Multi-modal Model an All-around Player?

📝 Paper Summary

Multimodal Evaluation Benchmark Vision-Language Models (VLMs)

MMBench is a comprehensive bilingual benchmark with over 3,000 multi-choice questions that uses a circular evaluation strategy and ChatGPT-assisted answer extraction to robustly assess vision-language models.

Core Problem

Existing VLM benchmarks suffer from false negatives due to exact matching requirements, lack fine-grained ability analysis, and subjective human evaluations are non-scalable and biased.

Why it matters:

Traditional metrics (e.g., VQAv2 accuracy) penalize correct answers phrased differently (e.g., 'bicycle' vs 'bike'), obscuring true model capability.
Subjective evaluations (e.g., OwlEval) are expensive and hard to reproduce, while objective benchmarks often fail to measure instruction-following limitations accurately.
Lack of standardized bilingual benchmarks prevents fair apples-to-apples comparison of VLMs in English and Chinese contexts.

Concrete Example: In VQA, if the reference answer is 'bike' but a model predicts 'bicycle', standard metrics assign a negative score. Similarly, models with poor instruction following might output 'the meaning of choice A' instead of just 'A', causing rule-based matching to fail even if the reasoning is correct.

Key Novelty

MMBench: Robust Bilingual Circular Evaluation

Introduces CircularEval: Feeds the same multiple-choice question to the VLM multiple times with shuffled choices to ensure the model actually knows the answer rather than guessing based on position bias.
Uses LLM-based Choice Extraction: Instead of rigid rule-based matching, employs GPT-4 to map free-form VLM predictions to specific multiple-choice options, salvaging correct answers from models with weak instruction-following.

Architecture

The pipeline for data construction and evaluation strategy, specifically highlighting the filtering process and the choice extraction flow.

Evaluation Highlights

GPT-4-based choice matching aligns with human assessment in 91.5% of cases, significantly reducing false negatives compared to traditional exact matching.
Constructed a dataset of 3,217 questions covering 20 fine-grained abilities (e.g., object localization, social reasoning) with rigorous quality control involving LLM voting and manual verification.
Evaluates 21 major vision-language models, revealing that proprietary models (like GPT-4v) generally outperform open-source ones, though instruction-following varies significantly.

Breakthrough Assessment

8/10

Significantly improves VLM evaluation robustness by addressing the 'exact match' problem and position bias. The integration of CircularEval and LLM-based extraction is a practical methodological advance for the field.

⚙️ Technical Details

Problem Definition

Setting: Multiple-choice Visual Question Answering (VQA) evaluation across diverse ability dimensions

Inputs: Image I, Question Q, and a set of Choices C

Outputs: Predicted Choice Label (e.g., A, B, C, or D)

Pipeline Flow

Question Curation & Quality Control
Inference with CircularEval
Answer Extraction (Heuristic + LLM)
Scoring

System Modules

Data Curator

Selects and verifies questions

Model or implementation: Human volunteers + LLMs (GPT-4, Gemini-Pro)

VLM Inference Engine (Evaluation Pipeline)

Generates answers for questions

Model or implementation: Target VLM being evaluated (e.g., LLaVA, GPT-4v)

Choice Extractor (Evaluation Pipeline)

Parses free-form prediction into a choice label

Model or implementation: Rule-based matcher fallback to GPT-4

Novel Architectural Elements

CircularEval strategy: Structurally modifying the inference pipeline to loop multiple times per question with state permutations (shuffled choices)
Two-stage extraction pipeline: Cascading from heuristic matching to LLM-based semantic matching for robust output parsing

Comparison to Prior Work

vs. VQAv2: Uses multiple choice + LLM extraction to avoid false negatives from synonym mismatches
vs. OwlEval: Fully objective and scalable (automated) rather than relying on human annotators
vs. MME: Significantly larger scale (3k+ vs smaller scale) and hierarchical ability taxonomy
+ 1 more
vs. SEED-Bench [not cited in paper]: MMBench incorporates CircularEval to mitigate position bias, whereas SEED-Bench typically uses single-pass evaluation

Limitations

Relies on GPT-4 for answer extraction, introducing potential bias or cost dependencies
Held-out test set prevents full independent analysis of the hardest samples
CircularEval increases inference cost linearly with the number of choices (N times per question)

Reproducibility

Code: https://github.com/open-compass/VLMEvalKit

Publicly available: Code in VLMEvalKit, Dev split (approx 40% of data). Missing: Test split labels (held out for server-side evaluation). Closed-source dependencies: Relies on OpenAI GPT-4 API for the extraction module and data construction validation.

📊 Experiments & Results

Evaluation Setup

Bilingual (English/Chinese) Multiple-Choice Question Answering

Benchmarks:

MMBench (Multimodal capability assessment across 20 skills) [New]

Metrics:

Accuracy (Top-1)
CircularEval Success Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Validation of the evaluation methodology itself, specifically the effectiveness of GPT-4 as an answer extractor compared to human judgment.
MMBench (Subset)	Alignment with Human Assessment	100	91.5	-8.5
Evaluation of model instruction-following capabilities using heuristic matching rates.
MMBench	Heuristic Matching Success Rate	100	Not reported in the paper	Not reported in the paper

Experiment Figures

Sunburst chart displaying the hierarchical taxonomy of abilities in MMBench (L-1, L-2, L-3) and question counts.

Main Takeaways

Proprietary models (e.g., GPT-4v) generally outperform open-source models on MMBench.
CircularEval reveals that many models rely on position bias; performance drops when required to be consistent across shuffled choices.
The rigorous data construction process (filtering text-only solvable questions) ensures the benchmark actually tests multimodal capabilities, not just language priors.
GPT-4 serves as a robust proxy for human evaluation in extracting choices from verbose or poorly formatted model outputs.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Models (VLMs) and VQA tasks
Familiarity with Large Language Models (LLMs) and prompting
Basic knowledge of evaluation metrics (Accuracy, Exact Match)

Key Terms

CircularEval: An evaluation strategy where the same question is asked multiple times with shuffled answer choices to test consistency and reduce guessing

L-3 Ability: The most fine-grained level in MMBench's taxonomy (Level-3), representing specific skills like 'Object Localization' or 'Social Reasoning'

Instruction Following: The ability of a model to adhere to formatting constraints in a prompt (e.g., 'Output only the option letter')

False Negative: When a model's correct answer is marked wrong because the evaluation metric cannot recognize it (e.g., synonym mismatch)

LLM-based Choice Extraction: Using a strong LLM (like GPT-4) to interpret a VLM's free-text output and map it to a specific multiple-choice option