← Back to Paper List

Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark

Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Yu Cheng
University of Electronic Science and Technology of China, Sun Yat-sen University, University of Washington, Microsoft, The Chinese University of Hong Kong
International Conference on Machine Learning (2025)
MM Benchmark Reasoning

📝 Paper Summary

Multimodal Reasoning Benchmark Visual Chain-of-Thought Evaluation
EMMA is a benchmark that rigorously filters out questions solvable via text or captions alone, revealing that current MLLMs fail significantly at tasks requiring true organic integration of visual and textual reasoning.
Core Problem
Existing multimodal benchmarks often contain 'fake' multimodal questions where the text fully describes the image, allowing models to shortcut reasoning without processing visual information.
Why it matters:
  • Current benchmarks overestimate MLLM capabilities by rewarding text-only reasoning rather than true cross-modal integration
  • Real-world tasks in physics, chemistry, and coding require 'organic' reasoning where visual aids are distinct from and complementary to textual descriptions, not redundant
  • Techniques like Chain-of-Thought (CoT) and test-time compute scaling are being optimized for text but may fail or hurt performance when applied to visual reasoning tasks
Concrete Example: In a physics problem asking for the direction of the net electric force (Figure 1), GPT-4o correctly states the text rule 'like charges repel' but selects the wrong vector direction. It possesses the textual knowledge but fails to ground this concept in the specific spatial layout of the image.
Key Novelty
Enhanced MultiModal ReAsoning (EMMA) Benchmark Construction
  • Implements a strict two-stage filtering pipeline: removes questions solvable by LLMs using (1) original text alone, AND (2) original text + GPT-4o generated image captions
  • Ensures all remaining questions strictly require direct visual processing (e.g., spatial manipulation, pattern recognition) that cannot be verbalized into captions
  • Constructs 1,796 novel questions in domains requiring spatial simulation, such as organic chemistry structure recognition and code-to-visualization tasks
Evaluation Highlights
  • OpenAI o1 (best model) achieves only 45.75% accuracy on the balanced subset, trailing human experts (~77%) by a massive 32% margin
  • o1 outperforms the best non-reasoning MLLM (Qwen2-VL) by 8.5%, but most models struggle to exceed 40%
  • Textual Chain-of-Thought (CoT) prompting frequently hurts performance on visual-heavy tasks compared to direct answering, suggesting text reasoning can hallucinate or disconnect from visual reality
Breakthrough Assessment
8/10
Critically exposes the 'fake multimodality' in existing benchmarks through rigorous filtering. Demonstrates that despite hype, MLLMs fundamentally lack fine-grained visual reasoning capabilities.
×