VRIQ: Benchmarking and Analyzing Visual-Reasoning IQ of VLMs

📝 Paper Summary

Visual Reasoning Vision-Language Model Evaluation Diagnostic Benchmarking

VRIQ demonstrates that Vision-Language Models fail visual IQ tasks primarily due to perceptual bottlenecks (extracting visual facts) rather than reasoning deficits, even when using tool-augmented inference.

Core Problem

Current benchmarks evaluate perception and reasoning as a monolithic capability, making it impossible to determine if models fail because they cannot 'see' the visual elements or because they cannot 'think' through the logic.

Why it matters:

High-stakes applications like medical diagnosis requiring visual reasoning are unreliable if models hallucinate basic visual facts
Existing benchmarks focus either on shallow natural image QA or abstract puzzles without real-world grounding, lacking a bridge between the two
Prior evaluations fail to pinpoint whether improvements should target the visual encoder (perception) or the language planner (reasoning)

Concrete Example: In a matrix reasoning puzzle, a model might fail to predict the next shape. Without diagnostic probes, it is unclear if the model failed to deduce the 'rotate 45 degrees' rule (reasoning error) or simply misidentified the initial shape as a square instead of a triangle (perception error).

Key Novelty

Parallel Domain Diagnostic Benchmarking

Constructs parallel sets of Abstract (geometric) and Natural (real-world) puzzles that share identical logical structures, allowing direct comparison across visual domains
Introduces a hierarchical probing framework: if a model fails a puzzle, it is tested on 'Perceptual probes' (visual fact-checking) and 'Reasoning probes' (text-only logic) to isolate the root cause

Evaluation Highlights

Average performance on abstract puzzles is near random (~28%), significantly lower than natural image tasks (~45%)
56% of model failures are attributed to Perception-only deficits (model knows the logic but misses visual facts)
Only 1% of failures are Reasoning-only (model sees correctly but fails logic), limiting the effectiveness of reasoning-focused improvements like Chain-of-Thought

Breakthrough Assessment

8/10

Provides a critical diagnostic reality check for VLMs, debunking the assumption that 'reasoning' is the primary bottleneck and rigorously quantifying the perception gap.

⚙️ Technical Details

Problem Definition

Setting: Multi-choice visual question answering in the format of IQ tests

Inputs: An image I containing a puzzle and a question q

Outputs: A selected option from candidate answers

Pipeline Flow

Tier 1: End-to-End Evaluation
Tier 2: Diagnostic Probing (if Tier 1 fails)
Tier 3: Error Categorization

System Modules

End-to-End Solver

Attempt to solve the standard IQ puzzle (Abstract or Natural)

Model or implementation: Various VLMs (e.g., GPT-4o, Qwen2.5-VL)

Perceptual Probe (P-probe) (Diagnosis)

Test extraction of specific visual facts required for the puzzle

Model or implementation: Same VLM as Solver

Reasoning Probe (R-probe) (Diagnosis)

Test application of logical rules given explicit text facts

Model or implementation: Same VLM as Solver

Novel Architectural Elements

Hierarchical evaluation architecture that triggers P-probes and R-probes only upon failure of the main task to categorize error sources

Modeling

Base Model: Diverse set including Qwen2.5-VL (3B/7B), GPT-4o, GPT-5.1, OpenAI o3

Training Method: Evaluation only (Pre-trained models)

Key Hyperparameters:

temperature: 0
max_tool_calls: 10

Compute: Not reported in the paper

Comparison to Prior Work

vs. MMIQ: VRIQ introduces parallel abstract/natural domains and a hierarchical probing framework for error attribution
vs. RAVEN: VRIQ includes manually curated/validated items and natural image equivalents
vs. MathVista: VRIQ focuses specifically on psychometric categories (Sequence, Matrix, Rotation) rather than general math

Limitations

R-probes provide text descriptions, which might be easier to process than visual embeddings, potentially inflating the estimated 'reasoning' capability
Possible training data contamination for Abstract puzzles adapted from public exams (mitigated by modification)
Evaluation relies on the models' ability to follow instructions for the probes themselves

📊 Experiments & Results

Evaluation Setup

Zero-shot visual question answering on IQ puzzles

Benchmarks:

VRIQ (Visual IQ Test (Abstract & Natural)) [New]

Metrics:

End-to-End Accuracy
P-only Failure Rate (Perception Bottleneck)
R-only Failure Rate (Reasoning Deficit)
P+R Failure Rate (Combined Failure)
Statistical methodology: 95% confidence intervals reported for human performance baseline

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance of VLMs (averaged across models) shows significant deficits compared to human capability and reveals perception as the primary bottleneck.
VRIQ (Abstract)	Average Accuracy	25.00	28.00	+3.00
VRIQ (Natural)	Average Accuracy	28.00	45.00	+17.00
VRIQ (Aggregate Failures)	Share of Failures (Perception Only)	0.00	56.00	+56.00
VRIQ (Aggregate Failures)	Share of Failures (Reasoning Only)	0.00	1.00	+1.00

Experiment Figures

Paired examples of Abstract and Natural puzzles across the five reasoning categories (Sequence, Matrix, Odd One Out, Rotation, 3D)

Main Takeaways

Perception is the dominant failure mode: VLMs struggle to count, identify shapes, and determine 3D orientation reliably, which prevents them from even attempting the reasoning step.
Tool-augmented reasoning (OpenAI o3) provides only modest improvements, suggesting that current tools do not sufficiently bridge the raw perception gap.
Models perform significantly better on Natural images than Abstract ones, likely due to the dominance of natural images in pre-training data.
Fine-grained analysis reveals specific perception categories (e.g., 3D/depth, rotation) cause more failures than simple color/shape identification.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Models (VLMs) and their typical failure modes
Familiarity with IQ test structures (Matrix reasoning, Odd-one-out)
Basic concepts of diagnostic probing in ML evaluation

Key Terms

VLM: Vision-Language Model—an AI model trained to understand and generate text based on visual inputs

P-probes: Perceptual probes—atomic questions testing one task-relevant visual attribute (e.g., 'How many red circles?') to verify if the model 'sees' the necessary facts

R-probes: Reasoning probes—text-only questions asking the model to apply a logical rule given explicit facts, testing logic without visual noise

Abstract puzzles: IQ tasks using geometric primitives (shapes, lines) and formal patterns

Natural puzzles: IQ tasks using real-world objects and scenes while maintaining the same logical category as abstract puzzles

SFT: Supervised Fine-Tuning—training a model on labeled examples to improve performance

Chain-of-Thought: A reasoning technique where the model generates intermediate steps before the final answer

Tinkercad: A web-based CAD platform used here to generate consistent 3D visualization puzzles