MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

📝 Paper Summary

Multimodal Evaluation Benchmark Large Multimodal Models (LMMs)

MM-Vet is a benchmark that evaluates large multimodal models on complicated tasks by defining six core capabilities and examining their integration using an LLM-based evaluator for open-ended responses.

Core Problem

Existing vision-language benchmarks focus on isolated capabilities (like recognition or OCR) and simple tasks, failing to evaluate how Large Multimodal Models (LMMs) integrate multiple skills to solve complex, real-world problems.

Why it matters:

Current benchmarks cannot assess the 'generalist' nature of LMMs which solve tasks requiring simultaneous reasoning, recognition, and spatial awareness.
Rapid model advancements (like GPT-4V) require evaluation metrics that handle diverse, open-ended answer styles beyond simple multiple-choice or binary classification.
Simple performance rankings hide model insights; developers need to know *which* specific capability integrations (e.g., OCR + Math) cause failure.

Concrete Example: A question asks: 'What will the girl on the right write on the board?' To answer, a model must detect genders (Recognition), locate the girl (Spatial Awareness), read the text (OCR), and solve the equation (Math). Existing benchmarks test these separately, missing the integration failure points.

Key Novelty

Capability Integration Evaluation

Defines 6 core Vision-Language (VL) capabilities (Recognition, Knowledge, OCR, Spatial, Generation, Math) and explicitly evaluates the 16 different combinations (integrations) of these skills.
Uses an LLM-based evaluator (GPT-4) with soft scoring (0.0 to 1.0) to handle open-ended outputs, unified across diverse question types and answer lengths.

Architecture

Contrast between existing benchmarks (isolated tasks) and MM-Vet (integrated capabilities).

Evaluation Highlights

GPT-4V achieves the highest total score of 67.8%, significantly outperforming open-source alternatives like LLaVA-13B (LLaMA-2) at 36.3%.
LLaVA-13B (LLaMA-2) outperforms LLaVA-13B (Vicuna-13B) by 8.3% on Recognition, suggesting stronger Language Models (LLMs) improve visual recognition.
MM-ReAct (using GPT-4 + tools) achieves the best OCR performance (65.7%) among open-source/tool-using systems, surpassing end-to-end models like LLaVA-13B (22.7%).

Breakthrough Assessment

8/10

Significant contribution to LMM evaluation by shifting focus from isolated tasks to integrated capabilities. The LLM-based soft scoring for open-ended QA is a practical modernization of VL benchmarking.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of Large Multimodal Models (LMMs) on integrated multimodal tasks

Inputs: Image and open-ended text question q requiring n >= 1 core capabilities

Outputs: Open-ended text response (evaluated for correctness score s in [0, 1])

Pipeline Flow

Question Classification (Core Capability Identification)
Model Inference (LMM generates response)
Evaluation (GPT-4 scores response against Ground Truth)

System Modules

Benchmark Dataset

Provide questions requiring specific capabilities (Rec, Know, OCR, Spat, Gen, Math)

Model or implementation: N/A (Dataset)

Target LMM

Generate open-ended answer to the multimodal question

Model or implementation: Various (e.g., LLaVA, OpenFlamingo, GPT-4V)

LLM Evaluator

Score the prediction based on correctness and quality

Model or implementation: GPT-4

Novel Architectural Elements

Taxonomy of 6 core VL capabilities and their 16 permutations for structured evaluation
Unified LLM-based soft-scoring metric (0.0-1.0) applicable across diverse answer styles (short phrase to essay)

Comparison to Prior Work

vs. MME/MMBench: MM-Vet focuses on 'integrated' capabilities (e.g., OCR+Math) rather than just isolated tasks or broad categories.
vs. VQA v2 [not cited in paper]: MM-Vet uses open-ended generation and LLM scoring rather than strict string matching or multiple choice.
vs. POPE [not cited in paper]: Focuses on complex reasoning capabilities rather than just object hallucination detection.

Limitations

Small dataset size (218 samples) compared to traditional benchmarks like VQA.
Reliance on GPT-4 as an evaluator introduces potential bias or variance (mitigated by averaging 5 runs).
Manual annotation of ground truth limits scalability.
Does not evaluate audio or video modalities.

Reproducibility

Code: https://github.com/yuweihao/MM-Vet

Benchmark data and evaluation code publicly available. Online evaluator hosted on HuggingFace Spaces. Dataset includes 200 images and 218 questions with annotated ground truths.

📊 Experiments & Results

Evaluation Setup

Open-ended multimodal question answering evaluated by GPT-4

Benchmarks:

MM-Vet (Integrated Multimodal QA) [New]

Metrics:

Total Score (0-100%)
Capability-specific Scores (Rec, OCR, Know, Gen, Spat, Math)
Statistical methodology: GPT-4 evaluation performed 5 times and averaged to reduce variance.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance comparison showing the dominance of proprietary models (GPT-4V) over open-source LMMs.
MM-Vet	Total Score	36.3	67.8	+31.5
MM-Vet	Total Score	24.8	36.3	+11.5
Analysis of specific capabilities reveals architectural strengths: Tool-use systems excel at OCR, while strong LLMs boost Recognition.
MM-Vet (OCR)	Score	22.7	65.7	+43.0
MM-Vet (Recognition)	Score	30.9	39.2	+8.3

Experiment Figures

Radar chart or distribution showing the frequency of different capabilities and their integrations in the MM-Vet dataset.

Main Takeaways

Proprietary models like GPT-4V significantly outperform current open-source LMMs on integrated tasks.
Stronger LLM backbones (e.g., LLaMA-2 vs Vicuna) improve visual recognition capabilities, likely due to better question understanding and key information extraction.
Tool-using systems (like MM-ReAct) vastly outperform end-to-end models on specific capabilities like OCR by leveraging specialized external tools.
Model performance varies significantly across capabilities; for example, LLaVA excels at recognition but struggles with OCR compared to tool-augmented systems.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Multimodal Models (LMMs) vs. LLMs
Familiarity with standard VL benchmarks (VQA, GQA)
Knowledge of LLM-based evaluation (LLM-as-a-judge)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

LMM: Large Multimodal Model—models capable of processing and generating both text and other modalities like images

LLM: Large Language Model—models trained on vast text data to generate human-like text

OCR: Optical Character Recognition—capability to detect and read text within images

VL: Vision-Language—tasks or models involving both visual and textual information

LLM-based evaluator: Using a strong LLM (like GPT-4) to judge the quality of model outputs instead of fixed rule-based metrics

ViT: Vision Transformer—a model architecture for image processing based on the Transformer mechanism

CLIP: Contrastive Language-Image Pre-training—a model trained to match images with their corresponding text captions

Vicuna: An open-source chatbot trained by fine-tuning LLaMA on user-shared conversations

LLaMA: Large Language Model Meta AI—a foundational large language model released by Meta

One-shot: Providing a model with a single example of a task to guide its performance

Few-shot: Providing a model with a few examples of a task to guide its performance