← Back to Paper List

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma Gongque, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, Honggang Zhang
Beijing University of Posts and Telecommunications, Wechat, Tencent Inc., Huazhong University of Science and Technology, Beijing Institute of Technology
arXiv.org (2024)
MM Benchmark Reasoning

📝 Paper Summary

Visual Mathematical Reasoning Large Multimodal Models (LMMs) Evaluation
We-Math evaluates LMMs by decomposing complex visual math problems into sub-problems based on hierarchical knowledge concepts, revealing that models often rely on rote memorization rather than true reasoning.
Core Problem
Existing visual math benchmarks rely on result-oriented evaluation (final answer correctness), failing to reveal whether LMMs truly understand the underlying principles or merely exploit shortcuts.
Why it matters:
  • High end-to-end accuracy can mask deep reasoning flaws, such as 'Rote Memorization' where models answer complex problems correctly but fail the prerequisite sub-problems
  • Current benchmarks overlook the hierarchical nature of mathematical knowledge, where mastering a complex concept requires mastering its dependencies
  • Identifying specific failure modes (e.g., lack of knowledge vs. lack of generalization) is crucial for progressing toward human-like reasoning
Concrete Example: A model might correctly solve a multi-step geometry problem asking for a shaded area (Composite Problem) but fail to correctly calculate the area of the constituent triangle or circle (Sub-problems) when asked individually. This contradiction suggests the model guessed or memorized the final answer rather than reasoning through the steps.
Key Novelty
Knowledge-based Hierarchical Decomposition and Process Evaluation
  • Deconstructs 6.5K visual math problems into a tree of 67 knowledge concepts and 5 granularity layers, mimicking human learning paths
  • Decomposes composite problems into atomic sub-problems to test if the model masters the necessary prerequisites before solving the complex task
  • Introduces a four-dimensional metric (Insufficient Knowledge, Inadequate Generalization, Complete Mastery, Rote Memorization) to diagnose *why* a model fails or succeeds
Architecture
Architecture Figure Figure 3
The pipeline for Knowledge-based Data Decomposition and the decision logic for the four-dimensional metric (IK, IG, CM, RM)
Evaluation Highlights
  • GPT-4o achieves the highest 'Complete Mastery' but still struggles with generalization; its primary challenge has shifted from 'Insufficient Knowledge' to 'Inadequate Generalization'
  • Many LMMs exhibit high 'Rote Memorization' rates (e.g., G-LLaVA-13B has ~36% loose Rote Memorization), solving complex problems while failing their sub-steps
  • Knowledge Concept Augmentation (providing concept definitions) reduces 'Insufficient Knowledge' errors, confirming that lack of domain definitions is a bottleneck for smaller models
Breakthrough Assessment
8/10
Significant shift from outcome-based to process-based evaluation in visual math. The 'Rote Memorization' metric exposes a critical flaw in current LMM reasoning that standard accuracy metrics miss.
×