MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

📝 Paper Summary

Visual Math Problem Solving Multi-modal LLM Evaluation Reasoning Benchmarks

MathVerse exposes that Multi-modal LLMs often rely on textual shortcuts in math problems by testing them on six problem variations with progressively reduced text and increased visual dependency.

Core Problem

Existing visual math benchmarks contain questions with redundant text that explicitly describes the diagram (e.g., 'Triangle ABC'), allowing models to solve problems without actually interpreting the visual input.

Why it matters:

Current high scores on math benchmarks may be illusory, reflecting text-processing skills rather than genuine multi-modal reasoning
Binary evaluation (Correct/Incorrect) fails to capture whether a model used the correct visual reasoning process or just guessed based on text patterns

Concrete Example: In a geometry problem where the text states 'Angle A is 45 degrees' and the diagram also shows '45°', an MLLM can solve it purely from text. MathVerse creates a version where the text only asks 'Find the angle' (forcing the model to read '45°' from the diagram), causing many models to fail.

Key Novelty

MathVerse Benchmark & CoT Evaluation Strategy

Transforms each math problem into six versions (e.g., Text-dominant vs. Vision-only) by systematically removing textual cues (Descriptive Information, Implicit Properties) to isolate visual understanding
Proposes a Chain-of-Thought (CoT) evaluation where GPT-4 extracts reasoning steps and GPT-4V scores them, assessing the logic process rather than just the final answer

Architecture

The data transformation process creating six versions of a single math problem to test different levels of visual dependency.

Evaluation Highlights

Qwen-VL-Max achieves +5.1% higher accuracy on Text-only versions compared to diagrams, proving it treats diagrams as distractions rather than information sources
InternLM-XComposer2 scores +5.6% higher without visual input, further confirming reliance on textual shortcuts over visual reasoning
GPT-4V demonstrates the best visual comprehension but still suffers performance drops when redundant textual descriptions are removed

Breakthrough Assessment

9/10

Critically exposes a fundamental flaw in how MLLMs are evaluated (blindness to diagrams due to redundant text) and provides a rigorous methodology (six-version transformation) to address it.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal Mathematical Reasoning Evaluation

Inputs: Visual diagram D and textual question Q (in 6 variations of information density)

Outputs: Answer A and Chain-of-Thought reasoning steps

Pipeline Flow

Problem Transformation (Human Annotators)
Model Inference (Target MLLM)
Step Extraction (GPT-4)
Step Scoring (GPT-4V)

System Modules

Problem Transformer

Converts raw math problems into 6 versions (Text-dominant to Vision-only) by filtering DI, IP, and EC textual components

Model or implementation: Human Annotators

Target MLLM

Solves the math problem version provided

Model or implementation: Various (e.g., GPT-4V, Gemini, LLaVA)

Step Extractor (Evaluation)

Extracts key reasoning steps from the model's output, ignoring the original question to avoid bias

Model or implementation: GPT-4 (Text-only)

Step Scorer (Evaluation)

Verifies each extracted step against the diagram and ground truth

Model or implementation: GPT-4V (Multi-modal)

Novel Architectural Elements

Two-stage CoT evaluation pipeline: separating 'Step Extraction' (text-only) from 'Step Scoring' (visual-aware) to mitigate evaluator bias

Modeling

Base Model: GPT-4V (Evaluator)

Reproducibility

Code: https://mathverse-cuhk.github.io

Benchmark dataset and project page are available at https://mathverse-cuhk.github.io. The dataset contains 2,612 base problems expanded to ~15K samples. The evaluation relies on GPT-4 and GPT-4V APIs.

📊 Experiments & Results

Evaluation Setup

Visual Math Problem Solving across 3 subjects (Plane Geometry, Solid Geometry, Functions)

Benchmarks:

MathVerse (Multi-modal Math Reasoning) [New]

Metrics:

Accuracy
CoT Score (Reasoning Quality)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results demonstrating that removing redundant text significantly impacts performance, or paradoxically improves it for models that view diagrams as noise.
MathVerse	Accuracy	Not reported in the paper	Not reported in the paper	+5.1%
MathVerse	Accuracy	Not reported in the paper	Not reported in the paper	+5.6%

Experiment Figures

Comparison of redundant text in existing benchmarks vs. MathVerse's approach, and a bar chart showing accuracy drops when redundant text is removed.

The Chain-of-Thought (CoT) Evaluation Strategy pipeline.

Main Takeaways

Most MLLMs struggle to interpret math diagrams and rely heavily on textual redundancy; performance drops significantly when text descriptions (Descriptive Information) are removed.
Some models (Qwen-VL-Max, InternLM-XComposer2) achieve higher accuracy on Text-only versions than on multi-modal versions, indicating that diagrams act as 'noise' rather than helpful context for these models.
GPT-4V and ShareGPT4V show relatively better visual comprehension compared to other models, but still exhibit gaps when implicit visual properties must be inferred.
The CoT evaluation strategy reveals that even when models get the final answer correct, their intermediate visual reasoning steps are often flawed.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multi-modal Large Language Models (MLLMs)
Basics of Chain-of-Thought (CoT) prompting
Visual Question Answering (VQA) task structure

Key Terms

MLLM: Multi-modal Large Language Model—AI models capable of processing and reasoning across both text and image inputs

CoT: Chain-of-Thought—a reasoning technique where the model generates intermediate steps before the final answer

Descriptive Information (DI): Textual content that explicitly describes observable elements in the diagram (e.g., 'There is a circle')

Implicit Property (IP): Geometric or spatial properties that require visual perception to identify (e.g., 'Lines AB and CD are parallel')

Essential Condition (EC): Specific numerical or algebraic measurements crucial for solving the problem (e.g., 'Length = 5')

GPT-4V: GPT-4 with Vision—a version of GPT-4 capable of analyzing images

Text-dominant: A problem version containing full redundant text, minimizing the need to look at the diagram

Vision-only: A problem version where the text is minimized, forcing the model to extract almost all information from the diagram