MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification

📝 Paper Summary

Multimodal Math Reasoning Large Multimodal Models (LMMs) Evaluation Process Evaluation

MM-MATH is a multimodal math benchmark comprising 5,929 problems that evaluates LMMs not just on final answers but also on reasoning steps, revealing that diagram misinterpretation is the primary cause of failure.

Core Problem

Existing multimodal math benchmarks rely on binary answer comparisons, failing to identify why models fail (e.g., visual misunderstanding vs. reasoning errors) or where their specific weaknesses lie.

Why it matters:

Current LMMs like GPT-4V still underperform on math tasks requiring interleaved text and image reasoning, but the specific failure modes remain opaque
Outcome-only evaluation hides whether a correct answer was derived from correct reasoning or lucky guessing
Lack of fine-grained metadata (difficulty, grade, knowledge point) in prior benchmarks hinders targeted improvement of model capabilities

Concrete Example: In a geometry problem where 'D is the midpoint of AB', a model might incorrectly reason 'AD=AB' instead of 'AD=DB'. A standard benchmark simply marks the final answer wrong; MM-MATH identifies this specific intermediate step as a 'Reasoning error' using LMM-as-a-judge.

Key Novelty

Process Evaluation via LMM-as-a-Judge

Combines traditional outcome evaluation with a process evaluation pipeline that compares model-generated steps against a groundtruth solution
Uses GPT-4V to identify the *first* error in a solution trace and categorize it into four specific types (e.g., Diagram misinterpretation, Calculation error)
Includes extensive metadata (difficulty, grade level, knowledge points) for fine-grained performance analysis rather than a single aggregate score

Architecture

Overview of the MM-MATH benchmark construction and evaluation pipeline.

Evaluation Highlights

GPT-4o achieves only 31.8% accuracy on MM-MATH, significantly lagging behind the human student average of 80.4%
Diagram misinterpretation accounts for >50% of errors across major LMMs (GPT-4o, GPT-4V, Claude-3-Opus), highlighting a critical vision-reasoning gap
Multimodal inputs (text+image) improve performance by only 2-4% over text-only inputs for top models, suggesting LMMs struggle to effectively utilize visual contexts

Breakthrough Assessment

8/10

While the dataset scale is moderate, the shift from binary outcome to process-oriented error analysis is a significant methodological improvement for understanding LMM reasoning failures.

⚙️ Technical Details

Problem Definition

Setting: Open-ended multimodal mathematical reasoning

Inputs: Textual problem statement P and associated image I

Outputs: Step-by-step solution S ending with a boxed final answer

Pipeline Flow

Data Collection & Filtering
Format Transformation (MathML to LaTeX)
Translation (Chinese to English)
Model Inference (Generate Solution)
Dual Evaluation (Outcome + Process)

System Modules

Data Processor

Converts raw MathML data to standard LaTeX and translates Chinese content to English

Model or implementation: MathConverter + GPT-4

Solver

Generates step-by-step solutions for math problems

Model or implementation: Various LMMs (e.g., GPT-4o, Gemini-Pro-V, InternVL)

Outcome Evaluator (Evaluation)

Extracts and compares final answers

Model or implementation: Rule-based scripts + SymPy

Process Evaluator (Evaluation)

Identifies the first error in the reasoning chain

Model or implementation: GPT-4V (LMM-as-a-judge)

Novel Architectural Elements

Process Evaluation Pipeline: Automated taxonomy of reasoning failures using LMM-as-a-judge to classify errors into 4 distinct categories (Diagram, Reasoning, Calculation, Textual)

Modeling

Base Model: Evaluates multiple models: GPT-4o, GPT-4V, Gemini-Pro-V, Claude-3-Opus, Qwen-VL-Max, InternVL-4B-Chat-V1.5, etc.

Training Method: N/A (Evaluation Paper)

Adaptation: None (Inference only)

Compute: Not reported in the paper

Comparison to Prior Work

vs. MATH-V/MathVista: MM-MATH adds process evaluation (step-by-step error analysis) instead of just final answer accuracy
vs. GeoQA: MM-MATH covers broader middle-school math topics beyond just geometry
vs. GSM8K [not cited in paper]: MM-MATH is multimodal, whereas GSM8K is text-only math reasoning
+ 1 more
vs. MMMU [not cited in paper]: MM-MATH focuses specifically on K-12 math curriculum with fine-grained knowledge points, whereas MMMU covers broad collegiate disciplines

Limitations

Process evaluation relies on GPT-4V, which may hallucinate or miss errors (approx. 9% error rate in judge)
Dataset focuses on middle school level; may not challenge models on collegiate/abstract math
Manual translation verification was performed, but original data was Chinese, potentially introducing cultural context biases

Reproducibility

Code: https://github.com/kge-sun/MM-Math

publicly available (https://github.com/kge-sun/MM-Math). Dataset contains 5,929 problems. Code provides evaluation scripts. Process evaluation relies on GPT-4V, a closed-source dependency.

📊 Experiments & Results

Evaluation Setup

Zero-shot generative reasoning on middle school math problems

Benchmarks:

MM-MATH (Multimodal Chain-of-Thought Reasoning) [New]

Metrics:

Accuracy (Outcome Evaluation)
Error Type Distribution (Process Evaluation)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Outcome evaluation shows a massive gap between SOTA LMMs and human students, and a surprisingly small gap between Text-Only and Multimodal inputs.
MM-MATH	Accuracy	80.4	31.8	-48.6
MM-MATH	Accuracy	11.6	31.8	+20.2
MM-MATH	Accuracy	27.6	31.8	+4.2
MM-MATH	Accuracy	23.2	25.9	+2.7
Performance degrades significantly as problem difficulty increases.
MM-MATH (Hard Subset)	Accuracy	45.8	10.9	-34.9

Experiment Figures

Pie charts showing the distribution of error types for different LMMs (GPT-4o, GPT-4V, etc.).

Main Takeaways

Significant Human-AI Gap: Even the best model (GPT-4o) is nearly 50 percentage points behind middle school students.
Visual Blindness: Models show very little improvement (approx 2-4%) when provided images compared to text-only, indicating they solve problems primarily via text shortcuts.
Diagram Misinterpretation is Key: Over 50% of errors in top models (GPT-4o, GPT-4V) are due to misinterpreting diagrams, identifying this as the primary bottleneck.
Small Models are Competitive: The 4B parameter InternVL model outperforms larger models like Gemini-Pro-V in visual recognition, suggesting model size is not the only factor for visual math competence.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Large Multimodal Models (LMMs)
Understanding of Chain-of-Thought (CoT) prompting
Basic knowledge of evaluation methodologies (Ground truth comparison)

Key Terms

LMM-as-a-judge: Using a strong LMM (like GPT-4V) to evaluate the quality or correctness of outputs from other models

Process Evaluation: Assessing the intermediate reasoning steps of a model's solution rather than just the final answer to identify specific error types

Outcome Evaluation: Traditional evaluation method comparing the final predicted answer to the ground truth (binary correct/incorrect)

Diagram misinterpretation: A specific error type where the model fails to correctly perceive geometric shapes, spatial relationships, or values presented in the image

MathML: A mathematical markup language for describing mathematical notation and capturing both its structure and content

SymPy: A Python library for symbolic mathematics, used here to simplify and compare algebraic expressions