MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

📝 Paper Summary

Multimodal Benchmarking Expert AGI Evaluation

MMMU is a massive benchmark designed to evaluate Large Multimodal Models (LMMs) on college-level tasks requiring expert subject knowledge and deliberate reasoning across 30 diverse image types.

Core Problem

Existing multimodal benchmarks focus on commonsense or elementary knowledge with limited image types (mostly photos), failing to test the expert-level reasoning and broad subject mastery required for Expert AGI.

Why it matters:

Current benchmarks like VQA are saturated by models that still cannot replace skilled human labor in specialized fields
Expert AGI requires proficiency at the level of skilled adults (college exams), which existing datasets do not measure
Critical domain-specific visual formats (medical scans, chemical structures, circuit diagrams) are largely absent from standard evaluations

Concrete Example: In a Music theory question, a model is shown sheet music and asked 'Which harmonic interval is constructed incorrectly?' Options include 'Major third' or 'Diminished fifth'. To answer, the model must read musical notation and apply music theory rules, a skill far beyond identifying objects in a natural photo.

Key Novelty

Expert-Level Multimodal Evaluation Benchmark

Curates 11.5K questions from college exams and textbooks across 6 disciplines (Art, Business, Science, Medicine, Humanities, Engineering) and 30 subjects
Includes 30 highly heterogeneous image types beyond natural photos, such as chemical structures, sheet music, path diagrams, and medical imaging
Focuses on joint perception and reasoning where text and images are interleaved, requiring deep domain knowledge rather than simple pattern recognition

Evaluation Highlights

GPT-4V achieves only 55.7% accuracy on the test set, lagging significantly behind Expert Human performance (88.6% on validation), highlighting the benchmark's difficulty
Open-source models trail significantly: LLaVA-1.5-13B achieves ~33.6% accuracy, showing a large gap compared to proprietary models
Models perform poorly on domain-specific imagery: GPT-4V scores high on photos but drops significantly for 'Chemical Structures' and 'Mechanical Diagrams'

Breakthrough Assessment

9/10

Sets a new, rigorously difficult standard for multimodal AGI, exposing the gap between current 'SOTA' and actual expert-level human capability.

⚙️ Technical Details

Problem Definition

Setting: Multi-discipline Multimodal Question Answering (Multiple Choice & Open-Ended)

Inputs: Natural language question Q interleaved with Image(s) I

Outputs: Answer A (Option selection or short text)

Comparison to Prior Work

vs. ScienceQA: MMMU targets college-level/expert knowledge (depth) across 30 subjects vs. elementary/high-school level
vs. MMLU: MMMU requires multimodal reasoning (images+text) vs. text-only reasoning in MMLU
vs. VQA v2: MMMU includes 30 expert image types (charts, medical, diagrams) vs. mostly natural scenes in VQA
+ 2 more
vs. MathVista [not cited in paper]: MMMU covers 6 broad disciplines vs. MathVista's exclusive focus on mathematics
vs. GAIA: MMMU focuses on subject-specific expert knowledge vs. GAIA's focus on tool use and general reasoning [not cited in paper]

Limitations

Focus on college exams may not fully capture 'Expert AGI' as defined by complex, real-world task performance
Manual curation process may carry biases from the specific textbooks or online sources used
Error analysis reveals models still struggle with basic perception in complex diagrams (e.g., confusing left/right in a flowchart), confounding reasoning evaluation

Reproducibility

Code: https://mmmu-benchmark.github.io/

Dataset, evaluation scripts, and leaderboard are publicly available at https://mmmu-benchmark.github.io/. The benchmark contains 11.5K questions. Code for baselines like LLaVA is open-source; proprietary models (GPT-4V) are accessed via API.

📊 Experiments & Results

Evaluation Setup

Zero-shot Question Answering on 11.5K questions across 30 subjects and 6 disciplines

Benchmarks:

MMMU (Multimodal Question Answering (Expert Level)) [New]

Metrics:

Micro-averaged Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Leaderboard results demonstrate a significant gap between proprietary state-of-the-art models and open-source models, as well as a large gap between the best models and human experts.
MMMU Test	Accuracy	23.9	69.1	+45.2
MMMU Test	Accuracy	44.7	55.7	+11.0
MMMU Validation	Accuracy	56.8	88.6	+31.8
MMMU (Easy vs Hard)	Accuracy	31.2	76.1	+44.9

Experiment Figures

Performance of various models (GPT-4V, LLaVA, etc.) broken down by specific image types (Diagrams, Tables, Chemical Structures, etc.)

Distribution of error types for GPT-4V based on manual analysis of 150 incorrect samples

Main Takeaways

Massive performance gap: Even SOTA models (GPT-4V/4o) represent a significant drop from Human Expert performance (88.6%), indicating the benchmark is far from solved
Domain disparity: Models perform well in Humanities/Art but struggle in Science, Medicine, and Engineering which require complex visual reasoning
Image type sensitivity: Models fail on uncommon image types (circuits, molecules) compared to photos, suggesting lack of training data diversity
Reasoning bottleneck: Error analysis shows 26% of errors are due to flawed reasoning and 29% due to lack of domain knowledge, even when perception is correct

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Multimodal Models (LMMs)
Zero-shot evaluation methodologies
Concepts of AGI (Artificial General Intelligence) levels

Key Terms

LMM: Large Multimodal Model—a neural network capable of processing and reasoning over both text and images (e.g., GPT-4V, LLaVA)

Expert AGI: AI systems that reach the 90th percentile of skilled adults in a broad range of tasks (Level 3 in AGI taxonomy)

Interleaved Input: Input sequences where images and text are mixed together, requiring the model to maintain context across modalities

OCR: Optical Character Recognition—converting text within images into machine-readable text strings

Zero-shot: Evaluating a model on a task without providing any examples of that task during inference

VQA: Visual Question Answering—a task where a model answers a question about a given image

Hallucination: When a model generates plausible-sounding but factually incorrect information