← Back to Paper List

MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen
IN.AI Research, University of Waterloo, The Ohio State University, Carnegie Mellon University, Princeton University
Computer Vision and Pattern Recognition (2023)
MM Benchmark Reasoning QA

📝 Paper Summary

Multimodal Benchmarking Expert AGI Evaluation
MMMU is a massive benchmark designed to evaluate Large Multimodal Models (LMMs) on college-level tasks requiring expert subject knowledge and deliberate reasoning across 30 diverse image types.
Core Problem
Existing multimodal benchmarks focus on commonsense or elementary knowledge with limited image types (mostly photos), failing to test the expert-level reasoning and broad subject mastery required for Expert AGI.
Why it matters:
  • Current benchmarks like VQA are saturated by models that still cannot replace skilled human labor in specialized fields
  • Expert AGI requires proficiency at the level of skilled adults (college exams), which existing datasets do not measure
  • Critical domain-specific visual formats (medical scans, chemical structures, circuit diagrams) are largely absent from standard evaluations
Concrete Example: In a Music theory question, a model is shown sheet music and asked 'Which harmonic interval is constructed incorrectly?' Options include 'Major third' or 'Diminished fifth'. To answer, the model must read musical notation and apply music theory rules, a skill far beyond identifying objects in a natural photo.
Key Novelty
Expert-Level Multimodal Evaluation Benchmark
  • Curates 11.5K questions from college exams and textbooks across 6 disciplines (Art, Business, Science, Medicine, Humanities, Engineering) and 30 subjects
  • Includes 30 highly heterogeneous image types beyond natural photos, such as chemical structures, sheet music, path diagrams, and medical imaging
  • Focuses on joint perception and reasoning where text and images are interleaved, requiring deep domain knowledge rather than simple pattern recognition
Evaluation Highlights
  • GPT-4V achieves only 55.7% accuracy on the test set, lagging significantly behind Expert Human performance (88.6% on validation), highlighting the benchmark's difficulty
  • Open-source models trail significantly: LLaVA-1.5-13B achieves ~33.6% accuracy, showing a large gap compared to proprietary models
  • Models perform poorly on domain-specific imagery: GPT-4V scores high on photos but drops significantly for 'Chemical Structures' and 'Mechanical Diagrams'
Breakthrough Assessment
9/10
Sets a new, rigorously difficult standard for multimodal AGI, exposing the gap between current 'SOTA' and actual expert-level human capability.
×