MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

📝 Paper Summary

Multimodal Benchmarking Visual Mathematical Reasoning

MathVista is a comprehensive benchmark combining diverse visual contexts and mathematical reasoning tasks to quantify the performance gap between state-of-the-art foundation models and human capability.

Core Problem

Current benchmarks focus either on text-only math or simple visual scenes, failing to evaluate the fine-grained visual understanding and compositional reasoning required for mathematically intensive real-world tasks.

Why it matters:

AI agents need strong visual math reasoning for applications in education, data analysis, and scientific discovery
Existing VQA (Visual Question Answering) datasets on natural scenes lack the depth of mathematical reasoning found in charts, function plots, and geometry diagrams
The capabilities of Large Multimodal Models (LMMs) in rigorous visual-mathematical contexts remain largely unexplored and unsystematically measured

Concrete Example: When asked to identify a function as injective from a plot, Multimodal Bard correctly identifies the function type (parabola) but uses the Law of Cosines (geometry) instead of function properties, leading to a hallucinated explanation.

Key Novelty

Unified Visual-Math Benchmark (MathVista)

Consolidates 28 existing multimodal datasets and introduces 3 new ones (IQTest, FunctionQA, PaperQA) to cover gaps in logical, algebraic, and scientific reasoning
Defines a taxonomy of 7 mathematical reasoning types (e.g., algebraic, statistical) and 5 primary tasks (e.g., geometry problem solving, textbook QA) for fine-grained evaluation
Implements a robust evaluation pipeline using GPT-4 as an answer extractor to standardize outputs from diverse foundation models

Architecture

The composition and distribution of the MathVista dataset across different source datasets

Evaluation Highlights

GPT-4V achieves 49.9% overall accuracy, establishing a new state-of-the-art but trailing human performance (60.3%) by 10.4 percentage points
GPT-4V outperforms Multimodal Bard (the second-best model) by 15.1 percentage points (49.9% vs 34.8%)
Text-only GPT-4 augmented with captions and OCR achieves 33.9% with Program-of-Thought prompting, performing comparably to Multimodal Bard

Breakthrough Assessment

9/10

Sets a definitive standard for evaluating multimodal math reasoning. The gap revealed between GPT-4V and other models, and the remaining gap to humans, will drive future LMM research.

⚙️ Technical Details

Problem Definition

Setting: Visual Question Answering requiring Mathematical Reasoning

Inputs: An image I (visual context) and a question Q (text)

Outputs: An answer A (either a multiple-choice option or a free-form numerical/textual value)

Pipeline Flow

Visual Preprocessing (Captioning/OCR) [for Aug-LLMs only]
Prompt Construction (Task + Visual Context + Question)
Model Inference (Response Generation)
Answer Extraction (GPT-4 Extractor)
Score Calculation

System Modules

Visual Preprocessing

Convert visual information into text for LLMs (Augmented-LLM setting only)

Model or implementation: Multimodal Bard (for captions) + EasyOCR (for text)

Model Inference

Generate a detailed response to the math question

Model or implementation: Various (e.g., GPT-4V, Bard, Claude-2)

Answer Extractor

Extract the final short answer from the model's verbose output

Model or implementation: GPT-4

Novel Architectural Elements

Introduction of three new dataset components (IQTest, FunctionQA, PaperQA) to fill gaps in logical and scientific visual reasoning [Benchmark contribution]
Automated answer extraction pipeline using GPT-4 to handle free-form LMM outputs [Evaluation methodology]

Modeling

Base Model: Evaluates multiple models: GPT-4V, Multimodal Bard, GPT-4 (Text), Claude-2, LLaVA, etc.

Training Method: Paper evaluates pre-trained models; does not propose a new training method

Adaptation: None (Zero-shot and Few-shot inference only)

Trainable Parameters: None (Inference only)

Training Data:

Total 6,141 examples
Sources: 28 existing datasets + 3 new datasets
Split: 1,000 testmini / 5,141 test

Key Hyperparameters:

temperature: 0 (for deterministic generation)
top_p: 1 (for deterministic generation)
max_tokens: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. ChartQA/Geometry3K: MathVista covers 7 math reasoning types and 5 tasks, whereas others are narrow/task-specific
vs. VQA v2: MathVista focuses on mathematical reasoning (logic, arithmetic, statistics) rather than natural scene understanding
vs. GSM-8K: MathVista involves visual contexts, whereas GSM-8K is text-only [not cited in paper]

Limitations

Evaluation of proprietary models (GPT-4V, Bard) relies on black-box APIs/Playgrounds which may change over time
Visual tool augmentation (captions/OCR) for text LLMs is imperfect, losing structural or spatial information
Automatic answer extraction via GPT-4, while accurate, may still introduce minor errors in free-form evaluation

Reproducibility

Code: https://mathvista.github.io

publicly available (https://mathvista.github.io). Dataset (testmini), evaluation scripts, and metadata are released. Test set labels are withheld for leaderboard integrity. GPT-4V and Bard are proprietary models evaluated via API/Playground.

📊 Experiments & Results

Evaluation Setup

Zero-shot and Few-shot (2-shot) generation with CoT/PoT prompting on the testmini subset

Benchmarks:

MathVista (Visual Mathematical Reasoning) [New]

Metrics:

Accuracy (Overall)
Accuracy (per task/reasoning type)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis of leading foundation models reveals GPT-4V's dominance and the limitations of open-source LMMs.
MathVista (testmini)	Accuracy	34.8	49.9	+15.1
MathVista (testmini)	Accuracy	60.3	49.9	-10.4
MathVista (testmini)	Accuracy	29.2	33.9	+4.7
MathVista (testmini)	Accuracy	17.9	26.1	+8.2

Experiment Figures

Radar chart comparing accuracies of GPT-4V, Bard, PoT GPT-4, Random Chance, and Humans across different reasoning types and visual contexts

Error analysis of Multimodal Bard, categorizing failure modes in answers and explanations

Main Takeaways

GPT-4V significantly outperforms all other models, including proprietary ones like Bard and augmented text-only GPT-4
Open-source LMMs (e.g., LLaVA, IDEFICS) perform poorly, often barely beating random chance, due to limited mathematical and OCR capabilities
Augmenting text-only LLMs with captions and OCR improves performance (PoT GPT-4 reaches 33.9%) but is bottlenecked by the quality of visual descriptions
A significant 10.4% gap remains between the best model (GPT-4V) and human performance, particularly in tasks requiring complex figure understanding and rigorous reasoning

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Models (VLMs)
Familiarity with mathematical reasoning types (algebra, geometry, statistics)
Basic knowledge of prompting strategies (CoT, PoT)

Key Terms

LMM: Large Multimodal Model—a foundation model capable of processing and reasoning over both text and images (e.g., GPT-4V, Bard)

CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer

PoT: Program-of-Thought—a prompting strategy where the model generates executable code (e.g., Python) to solve the problem

OCR: Optical Character Recognition—technology to convert text within images into machine-readable text formats

VQA: Visual Question Answering—the task of answering a natural language question based on the content of an image

Hallucination: A phenomenon where a model generates plausible-sounding but factually incorrect information or detects objects/relationships not present in the input

FQA: Figure Question Answering—answering questions based on statistical plots and charts

MathQA: Math-targeted Question Answering—datasets specifically designed to test mathematical problem solving

GPS: Geometry Problem Solving—tasks involving reasoning about geometric shapes and diagrams

TQA: Textbook Question Answering—tasks derived from educational materials, often requiring domain knowledge