← Back to Paper List

VRIQ: Benchmarking and Analyzing Visual-Reasoning IQ of VLMs

Tina Khezresmaeilzadeh, Jike Zhong, Konstantinos Psounis
arXiv (2026)
MM Benchmark Reasoning Agent

📝 Paper Summary

Visual Reasoning Vision-Language Model Evaluation Diagnostic Benchmarking
VRIQ demonstrates that Vision-Language Models fail visual IQ tasks primarily due to perceptual bottlenecks (extracting visual facts) rather than reasoning deficits, even when using tool-augmented inference.
Core Problem
Current benchmarks evaluate perception and reasoning as a monolithic capability, making it impossible to determine if models fail because they cannot 'see' the visual elements or because they cannot 'think' through the logic.
Why it matters:
  • High-stakes applications like medical diagnosis requiring visual reasoning are unreliable if models hallucinate basic visual facts
  • Existing benchmarks focus either on shallow natural image QA or abstract puzzles without real-world grounding, lacking a bridge between the two
  • Prior evaluations fail to pinpoint whether improvements should target the visual encoder (perception) or the language planner (reasoning)
Concrete Example: In a matrix reasoning puzzle, a model might fail to predict the next shape. Without diagnostic probes, it is unclear if the model failed to deduce the 'rotate 45 degrees' rule (reasoning error) or simply misidentified the initial shape as a square instead of a triangle (perception error).
Key Novelty
Parallel Domain Diagnostic Benchmarking
  • Constructs parallel sets of Abstract (geometric) and Natural (real-world) puzzles that share identical logical structures, allowing direct comparison across visual domains
  • Introduces a hierarchical probing framework: if a model fails a puzzle, it is tested on 'Perceptual probes' (visual fact-checking) and 'Reasoning probes' (text-only logic) to isolate the root cause
Evaluation Highlights
  • Average performance on abstract puzzles is near random (~28%), significantly lower than natural image tasks (~45%)
  • 56% of model failures are attributed to Perception-only deficits (model knows the logic but misses visual facts)
  • Only 1% of failures are Reasoning-only (model sees correctly but fails logic), limiting the effectiveness of reasoning-focused improvements like Chain-of-Thought
Breakthrough Assessment
8/10
Provides a critical diagnostic reality check for VLMs, debunking the assumption that 'reasoning' is the primary bottleneck and rigorously quantifying the perception gap.
×