← Back to Paper List

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wen-gang Zhou, A. Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, Jinguo Zhu
University of Science and Technology of China, Xi’an Jiaotong University, Shanghai Artificial Intelligence Laboratory, SenseTime Research, Tsinghua University
arXiv.org (2025)
MM Benchmark Reasoning RL

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Visual Reasoning Evaluation
VisuLogic is a benchmark of 1,000 visual puzzles designed to be difficult to caption, revealing that state-of-the-art multimodal models perform near random chance on tasks requiring genuine visual logic.
Core Problem
Current MLLM benchmarks allow models to bypass visual reasoning by relying on text recognition or captions, failing to test the ability to deduce logical relationships between visual elements.
Why it matters:
  • Models achieving high scores on existing benchmarks (like MathVista) often do so by converting images to text and using LLM priors, masking deep deficits in visual cognition
  • Genuine visual reasoning (understanding spatial transformations, attribute shifts, and stylistic patterns) is critical for AGI but remains poorly measured
  • Leading models like GPT-4o score exceptionally low (~26%) on this benchmark, indicating a significant blind spot in current multimodal capabilities
Concrete Example: In a 'Quantitative Reasoning' puzzle showing a grid of dots changing count and color, a text-only LLM (fed a caption) fails because the caption misses the subtle arithmetic progression. An MLLM might recognize the dots but fails to deduce the 'add one black dot' rule, guessing randomly.
Key Novelty
Hard-to-Caption Visual Logic Benchmark
  • Constructs problems where the solution depends on visual relationships (e.g., rotation, superposition, intersection) that are inherently difficult to describe in text, blocking language-based shortcuts
  • Provides a taxonomy of six distinct reasoning types (e.g., Stylistic, Attribute, Positional) to diagnose specific visual-cognitive failures rather than general VQA performance
Evaluation Highlights
  • State-of-the-art MLLMs achieve near-random performance: GPT-4o (26.3%) and Gemini-2.0-Pro (28.0%) barely exceed the 24.9% random baseline
  • Human performance (51.4%) is nearly double that of the best models, highlighting a massive gap in visual reasoning capabilities
  • Reinforcement Learning (RL) fine-tuning on supplementary data boosts InternVL2.5-38B from 25.5% to 31.1%, setting a new state-of-the-art
Breakthrough Assessment
8/10
Exposes a critical weakness in current SOTA models (near-random performance) and provides a rigorous benchmark + training data to address it. The gap between human and model performance is striking.
×