VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Visual Reasoning Evaluation

VisuLogic is a benchmark of 1,000 visual puzzles designed to be difficult to caption, revealing that state-of-the-art multimodal models perform near random chance on tasks requiring genuine visual logic.

Core Problem

Current MLLM benchmarks allow models to bypass visual reasoning by relying on text recognition or captions, failing to test the ability to deduce logical relationships between visual elements.

Why it matters:

Models achieving high scores on existing benchmarks (like MathVista) often do so by converting images to text and using LLM priors, masking deep deficits in visual cognition
Genuine visual reasoning (understanding spatial transformations, attribute shifts, and stylistic patterns) is critical for AGI but remains poorly measured
Leading models like GPT-4o score exceptionally low (~26%) on this benchmark, indicating a significant blind spot in current multimodal capabilities

Concrete Example: In a 'Quantitative Reasoning' puzzle showing a grid of dots changing count and color, a text-only LLM (fed a caption) fails because the caption misses the subtle arithmetic progression. An MLLM might recognize the dots but fails to deduce the 'add one black dot' rule, guessing randomly.

Key Novelty

Hard-to-Caption Visual Logic Benchmark

Constructs problems where the solution depends on visual relationships (e.g., rotation, superposition, intersection) that are inherently difficult to describe in text, blocking language-based shortcuts
Provides a taxonomy of six distinct reasoning types (e.g., Stylistic, Attribute, Positional) to diagnose specific visual-cognitive failures rather than general VQA performance

Evaluation Highlights

State-of-the-art MLLMs achieve near-random performance: GPT-4o (26.3%) and Gemini-2.0-Pro (28.0%) barely exceed the 24.9% random baseline
Human performance (51.4%) is nearly double that of the best models, highlighting a massive gap in visual reasoning capabilities
Reinforcement Learning (RL) fine-tuning on supplementary data boosts InternVL2.5-38B from 25.5% to 31.1%, setting a new state-of-the-art

Breakthrough Assessment

8/10

Exposes a critical weakness in current SOTA models (near-random performance) and provides a rigorous benchmark + training data to address it. The gap between human and model performance is striking.

⚙️ Technical Details

Problem Definition

Setting: Multi-choice visual question answering requiring logical deduction

Inputs: An image I containing a visual puzzle and a question Q

Outputs: A single choice answer A from options {A, B, C, D}

Modeling

Base Model: InternVL2.5-38B (for the RL baseline experiment)

Training Method: Rule-based Reinforcement Learning (RL)

Objective Functions:

Purpose: Optimize the model to generate correct reasoning paths and answers.

Formally: Not explicitly detailed in the main text summary, but described as a 'simple reinforcement-learning (RL) fine-tuning step'.

Training Data:

Supplementary training set of 4,296 question-answer pairs
Generated from similar domains as the benchmark but with no overlap
Split mirrors the benchmark taxonomy (Quantitative, Spatial, etc.)

Compute: Not reported in the paper

Comparison to Prior Work

vs. MathVista: VisuLogic focuses on pure visual logic (patterns, shapes) rather than math problems that can often be solved via text translation
vs. MMMU: Focuses on fundamental visual cognitive skills (spatial, attribute) rather than domain-specific knowledge (chemistry, physics)
vs. LogicVista: VisuLogic requires multi-step visual analysis (e.g., 3D reconstruction, stylistic changes) rather than single-hop superficial queries

Limitations

Current models, including the RL baseline, still perform significantly below human level (31.1% vs 51.4%)
The random baseline is 25%, meaning even the best models are only marginally better than guessing
CoT prompting yields minimal improvement, suggesting current text-based CoT training does not transfer well to visual logic

Reproducibility

Code: https://visulogic-benchmark.github.io/VisuLogic

Benchmark data (1,000 questions), supplementary training data (4,296 questions), and baseline code are publicly available at https://visulogic-benchmark.github.io/VisuLogic. Hyperparameters for the RL baseline are provided in the Appendix (implied).

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation of pre-trained MLLMs and LLMs on 1,000 visual reasoning questions

Benchmarks:

VisuLogic (Visual Logical Reasoning) [New]

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance comparison showing that both LLMs (with captions) and MLLMs perform poorly, barely exceeding random chance.
VisuLogic	Accuracy	24.9	28.1	+3.2
VisuLogic	Accuracy	51.4	28.1	-23.3
VisuLogic	Accuracy	28.0	26.3	-1.7
Impact of Reinforcement Learning (RL) on visual reasoning performance.
VisuLogic	Accuracy	25.5	31.1	+5.6
VisuLogic	Accuracy	26.0	28.0	+2.0

Experiment Figures

Radar charts comparing error rates of LLMs, MLLMs, and Humans across the six reasoning categories (Quantitative, Spatial, Positional, Attribute, Stylistic, Other)

Qualitative examples of success and failure cases for different models (LLM vs MLLM vs RL-model)

Main Takeaways

Text-only reasoning is insufficient: LLMs fed detailed captions perform near random chance, proving the benchmark requires genuine visual processing
The 'Visual Reasoning Gap': Current SOTA MLLMs (GPT-4o, Gemini) are functionally equivalent to random guessers on deep visual logic tasks
Reinforcement Learning is a promising path: RL fine-tuning yields the most significant performance gains, allowing models to learn multi-step visual deduction strategies
Stylistic Reasoning is the hardest category: Models have error rates >75% (worse than random) on questions involving stylistic changes like overlays and contours

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (MLLMs)
Visual Question Answering (VQA)
Reinforcement Learning (RL)

Key Terms

MLLM: Multimodal Large Language Model—an AI system capable of processing and reasoning over both text and images

CoT: Chain-of-Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer

RL: Reinforcement Learning—a training method where models learn by receiving feedback (rewards) for their actions

VisuLogic: The proposed benchmark dataset focusing on visual logical reasoning tasks

SFT: Supervised Fine-Tuning—training a model on a labeled dataset of inputs and correct outputs