We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

📝 Paper Summary

Visual Mathematical Reasoning Large Multimodal Models (LMMs) Evaluation

We-Math evaluates LMMs by decomposing complex visual math problems into sub-problems based on hierarchical knowledge concepts, revealing that models often rely on rote memorization rather than true reasoning.

Core Problem

Existing visual math benchmarks rely on result-oriented evaluation (final answer correctness), failing to reveal whether LMMs truly understand the underlying principles or merely exploit shortcuts.

Why it matters:

High end-to-end accuracy can mask deep reasoning flaws, such as 'Rote Memorization' where models answer complex problems correctly but fail the prerequisite sub-problems
Current benchmarks overlook the hierarchical nature of mathematical knowledge, where mastering a complex concept requires mastering its dependencies
Identifying specific failure modes (e.g., lack of knowledge vs. lack of generalization) is crucial for progressing toward human-like reasoning

Concrete Example: A model might correctly solve a multi-step geometry problem asking for a shaded area (Composite Problem) but fail to correctly calculate the area of the constituent triangle or circle (Sub-problems) when asked individually. This contradiction suggests the model guessed or memorized the final answer rather than reasoning through the steps.

Key Novelty

Knowledge-based Hierarchical Decomposition and Process Evaluation

Deconstructs 6.5K visual math problems into a tree of 67 knowledge concepts and 5 granularity layers, mimicking human learning paths
Decomposes composite problems into atomic sub-problems to test if the model masters the necessary prerequisites before solving the complex task
Introduces a four-dimensional metric (Insufficient Knowledge, Inadequate Generalization, Complete Mastery, Rote Memorization) to diagnose *why* a model fails or succeeds

Architecture

The pipeline for Knowledge-based Data Decomposition and the decision logic for the four-dimensional metric (IK, IG, CM, RM)

Evaluation Highlights

GPT-4o achieves the highest 'Complete Mastery' but still struggles with generalization; its primary challenge has shifted from 'Insufficient Knowledge' to 'Inadequate Generalization'
Many LMMs exhibit high 'Rote Memorization' rates (e.g., G-LLaVA-13B has ~36% loose Rote Memorization), solving complex problems while failing their sub-steps
Knowledge Concept Augmentation (providing concept definitions) reduces 'Insufficient Knowledge' errors, confirming that lack of domain definitions is a bottleneck for smaller models

Breakthrough Assessment

8/10

Significant shift from outcome-based to process-based evaluation in visual math. The 'Rote Memorization' metric exposes a critical flaw in current LMM reasoning that standard accuracy metrics miss.

⚙️ Technical Details

Problem Definition

Setting: Visual mathematical reasoning where a model must solve a main problem Q_i given image I_i, while also solving decomposed sub-problems {q_i^m} corresponding to required knowledge concepts {k_i^m}

Inputs: Image I, Question Q, optionally Knowledge Concept K and Prerequisite Condition C

Outputs: Answer A for the main problem and answers {a^m} for sub-problems

Pipeline Flow

Problem Collection & Taxonomy (67 concepts, 5 layers)
Knowledge-based Decomposition (Human experts break composite problems into atomic sub-problems)
Model Inference (Run LMM on both composite and sub-problems)
Diagnostic Categorization (Classify results into IK, IG, CM, RM)

System Modules

Knowledge Decomposition

Break down complex problems into a sequence of sub-problems based on required knowledge concepts

Model or implementation: Human Expert Annotation

Evaluation Engine

Execute model on all problem variations and categorize reasoning behavior

Model or implementation: Evaluated LMM (e.g., GPT-4o, LLaVA)

Novel Architectural Elements

Four-dimensional diagnostic metric architecture (IK, IG, CM, RM) integrating performance on both atomic and composite tasks
Hierarchical dependency graph for visual math problems linking 5 layers of granularity (discipline -> category -> problem type -> knowledge concept)

Modeling

Base Model: Various LMMs evaluated (GPT-4o, GPT-4V, Gemini 1.5 Pro, LLaVA-NeXT, etc.)

Reproducibility

Code: https://github.com/We-Math/We-Math

📊 Experiments & Results

Evaluation Setup

Zero-shot visual question answering on We-Math dataset

Benchmarks:

We-Math (Visual Mathematical Reasoning) [New]

Metrics:

Accuracy (End-to-End)
CM (Complete Mastery) Rate
IK (Insufficient Knowledge) Rate
IG (Inadequate Generalization) Rate
RM (Rote Memorization) Rate
We-Math Score (Weighted metric)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance on We-Math shows GPT-4o leading significantly, with a clear gap between proprietary and open-source models.
We-Math	We-Math Score	51.3	61.7	+10.4
We-Math	We-Math Score	46.2	61.7	+15.5
The diagnostic metrics reveal different failure modes: GPT-4o suffers more from generalization issues (IG), while smaller models suffer from knowledge gaps (IK) and rote memorization (RM).
We-Math	IK (Insufficient Knowledge) Count	350	119	-231
We-Math	RM (Rote Memorization - Loose) Rate	12.8	35.8	+23.0
Knowledge Concept Augmentation (KCA) experiments show that providing definitions helps address Insufficient Knowledge (IK).
We-Math	IK (Insufficient Knowledge) Reduction	Not explicitly reported in the paper	Not explicitly reported in the paper	Not explicitly reported in the paper

Experiment Figures

Radar charts comparing model performance across 5 math categories and a bar chart showing the distribution of IK, IG, CM, RM for various models

Main Takeaways

Negative correlation between number of reasoning steps and performance: models perform significantly worse on multi-step problems than one-step sub-problems
GPT-4o is the first model where the primary bottleneck shifts from 'Insufficient Knowledge' (not knowing the concept) to 'Inadequate Generalization' (knowing concepts but failing to combine them)
Open-source models like G-LLaVA exhibit high 'Rote Memorization', solving complex queries while failing basic ones, suggesting training data contamination or shortcut learning
Knowledge Concept Augmentation effectively mitigates 'Insufficient Knowledge' errors, indicating models often lack specific domain definitions rather than reasoning capacity per se

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Multimodal Models (LMMs)
Mathematical reasoning evaluation (Chain of Thought)
Hierarchical knowledge structures (dependency trees)

Key Terms

IK: Insufficient Knowledge—The model fails on sub-problems (basic concepts) and consequently fails the main composite problem

IG: Inadequate Generalization—The model solves all sub-problems correctly but fails to combine them to solve the main composite problem

CM: Complete Mastery—The model correctly solves both the sub-problems and the main composite problem

RM: Rote Memorization—The model fails sub-problems but 'correctly' answers the main problem, implying guessing or data leakage

KCA: Knowledge Concept Augmentation—A strategy of providing explicit textbook definitions/formulas of knowledge concepts to the model to aid reasoning

LMM: Large Multimodal Model—AI models capable of processing and reasoning over both text and visual inputs

COT: Chain of Thought—prompting technique encouraging models to generate intermediate reasoning steps

Multi-step problem: A math problem requiring the application of multiple distinct knowledge concepts (e.g., area formula + subtraction)

One-step problem: A math problem testing a single atomic knowledge concept