Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark

📝 Paper Summary

Multimodal Reasoning Benchmark Visual Chain-of-Thought Evaluation

EMMA is a benchmark that rigorously filters out questions solvable via text or captions alone, revealing that current MLLMs fail significantly at tasks requiring true organic integration of visual and textual reasoning.

Core Problem

Existing multimodal benchmarks often contain 'fake' multimodal questions where the text fully describes the image, allowing models to shortcut reasoning without processing visual information.

Why it matters:

Current benchmarks overestimate MLLM capabilities by rewarding text-only reasoning rather than true cross-modal integration
Real-world tasks in physics, chemistry, and coding require 'organic' reasoning where visual aids are distinct from and complementary to textual descriptions, not redundant
Techniques like Chain-of-Thought (CoT) and test-time compute scaling are being optimized for text but may fail or hurt performance when applied to visual reasoning tasks

Concrete Example: In a physics problem asking for the direction of the net electric force (Figure 1), GPT-4o correctly states the text rule 'like charges repel' but selects the wrong vector direction. It possesses the textual knowledge but fails to ground this concept in the specific spatial layout of the image.

Key Novelty

Enhanced MultiModal ReAsoning (EMMA) Benchmark Construction

Implements a strict two-stage filtering pipeline: removes questions solvable by LLMs using (1) original text alone, AND (2) original text + GPT-4o generated image captions
Ensures all remaining questions strictly require direct visual processing (e.g., spatial manipulation, pattern recognition) that cannot be verbalized into captions
Constructs 1,796 novel questions in domains requiring spatial simulation, such as organic chemistry structure recognition and code-to-visualization tasks

Evaluation Highlights

OpenAI o1 (best model) achieves only 45.75% accuracy on the balanced subset, trailing human experts (~77%) by a massive 32% margin
o1 outperforms the best non-reasoning MLLM (Qwen2-VL) by 8.5%, but most models struggle to exceed 40%
Textual Chain-of-Thought (CoT) prompting frequently hurts performance on visual-heavy tasks compared to direct answering, suggesting text reasoning can hallucinate or disconnect from visual reality

Breakthrough Assessment

8/10

Critically exposes the 'fake multimodality' in existing benchmarks through rigorous filtering. Demonstrates that despite hype, MLLMs fundamentally lack fine-grained visual reasoning capabilities.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Question Answering where the answer A depends on reasoning over both Text T and Image I

Inputs: Image I and Text Question T (where T alone or T + Caption(I) is insufficient)

Outputs: Answer A (Multiple Choice or Open-Ended with short answer)

Comparison to Prior Work

vs. MMMU-Pro: EMMA applies stricter filtering by removing questions solvable via image captions, ensuring questions require direct visual processing rather than just visual perception
vs. MathVista: EMMA filters out ~50% of questions from such benchmarks that are actually solvable via text shortcuts [implied by filtering process]
vs. Visual CoT [not cited in paper]: EMMA focuses on reasoning tasks (simulation, transformation) rather than just perception tasks where cropping/highlighting (Visual CoT) applies

Limitations

Filtering pipeline relies on GPT-4o and Llama-3, whose own limitations might affect which questions are discarded or kept
Physics and Chemistry problems meeting the strict multimodal criteria were difficult to source, limiting dataset size in those domains compared to Math
Evaluation shows that current reward models are unreliable for multimodal reasoning, complicating the use of test-time scaling methods

Reproducibility

The paper describes the EMMA benchmark construction in detail, including the filtering pipeline and data sources (Math-Vision, MathVista, etc.). It mentions 2,788 problems total. However, the text provided does not contain a specific URL for the dataset or code release ('not provided').

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation of 9 SoTA MLLMs using both Direct Answering and Chain-of-Thought (CoT) prompting

Benchmarks:

EMMA (Multimodal Reasoning (Math, Physics, Chemistry, Coding)) [New]

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on the balanced subset of EMMA shows a significant gap between the best reasoning model (o1), standard MLLMs, and human experts.
EMMA (Balanced Subset)	Accuracy	37.25	45.75	+8.50
EMMA (Balanced Subset)	Accuracy	77.75	45.75	-32.00

Experiment Figures

Examples of EMMA tasks across domains: Math (Pattern Inference), Physics (Visual Decomposition), Chemistry (Reaction Simulation), and Coding (3D Visualization)

The EMMA data construction and filtering pipeline

Main Takeaways

Chain-of-Thought (CoT) prompting often fails or negatively impacts performance on visual-heavy tasks, as models hallucinate text reasoning that contradicts visual evidence
Test-time compute scaling (e.g., Best-of-N) provides minimal gains because models struggle to generate any valid visual reasoning path, unlike in text-only tasks where scaling is effective
Models perform particularly poorly on tasks requiring spatial simulation (e.g., 3D transformations, reaction mechanisms), highlighting a fundamental lack of 'visual physics' understanding

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multimodal Large Language Models (MLLMs)
Familiarity with Chain-of-Thought (CoT) prompting
Knowledge of test-time compute scaling (e.g., Best-of-N, Majority Voting)

Key Terms

MLLM: Multimodal Large Language Model—an AI model capable of processing and reasoning over both text and image inputs

CoT: Chain-of-Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer

Test-time compute scaling: Techniques to improve model performance during inference (not training) by spending more computational resources, such as generating multiple answers and voting (Majority Voting)

Visual reasoning: The ability to manipulate, analyze, and infer conclusions from visual inputs (e.g., spatial rotation, path tracing), distinct from merely recognizing objects

Organic multimodal reasoning: Reasoning that requires integrating complementary information from both text and vision, where neither modality is sufficient on its own

EMMA: Enhanced MultiModal ReAsoning—the benchmark introduced in this paper

SoTA: State-of-the-Art—the current best performing models or methods