DatBench: Discriminative, Faithful, and Efficient VLM Evaluations

📝 Paper Summary

Vision-Language Model Evaluation Benchmark Curation Efficient Evaluation

DatBench systematically improves VLM evaluation by transforming multiple-choice tasks into generative ones, removing questions solvable without visual input, and selecting high-signal examples to reduce compute costs by 13x.

Core Problem

Existing VLM benchmarks are inefficient and unfaithful: they rely on multiple-choice formats that inflate scores via guessing, contain many questions solvable without images, and suffer from labeling errors.

Why it matters:

Evaluation consumes up to 20% of total development compute for frontier models like OLMo3
70% of questions in benchmarks like VQA-v2 are 'blindly solvable' using language priors alone, failing to test multimodal reasoning
Inflated scores from multiple-choice guessing (up to 30% gap) obscure genuine capability differences, causing researchers to 'hill-climb on noise'

Concrete Example: On the AI2D benchmark, models achieve 77.56% average accuracy on multiple-choice questions but drop to 40.53% when the options are removed, revealing that much of the perceived performance is due to guessing or elimination strategies rather than visual understanding.

Key Novelty

Data-Centric Evaluation Curation (Transformation, Filtering, Selection)

Faithfulness via Transformation: Converts Multiple Choice Questions (MCQs) into open-ended generative tasks judged by an LLM to prevent guessing, or uses Circular Evaluation to rotate options when MCQs are necessary.
Discriminative Efficiency: Instead of random sampling, selects evaluation subsets using point-biserial correlation to maximize the separation between strong and weak models, achieving high signal density with fewer samples.

Evaluation Highlights

13x average speedup (up to 50x) across 9 capabilities compared to full benchmarks while maintaining discriminative power
Identified and removed 70%+ of VQA-v2 samples that were solvable by text-only models without visual input
Revealed a ~35 point accuracy drop on AI2D for strong models when switching from MCQ to generative evaluation, correcting inflated capability estimates

Breakthrough Assessment

9/10

Critically addresses the 'evaluation crisis' in VLMs by exposing massive score inflation and inefficiency. The shift from rank-preservation to discrimination-maximization for subset selection is a methodological advance.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of general-purpose Vision-Language Models across diverse capabilities (Chart, Document, Scene OCR, Math, etc.)

Inputs: A pool of existing public VLM evaluation datasets (33 datasets spanning 9 capabilities)

Outputs: DatBench (a curated, efficient subset) and DatBench-Full (a cleaned, high-quality full set)

Pipeline Flow

Ingest 33 raw VLM datasets
Group: Transformation Module (MCQ -> Generative)
Group: Filtering Module (Blind & Quality)
Group: Selection Module (Discriminative Subset)
Output DatBench

System Modules

Generative Transformer

Convert MCQs to open-ended tasks to remove guessing bias

Model or implementation: Rule-based processing

Blind Solver Filter (Filtering)

Remove questions solvable without visual input

Model or implementation: Ensemble of 27 VLMs (text-only mode)

Quality Filter (Filtering)

Remove ambiguous, low-res, or mislabeled samples

Model or implementation: GPT-5.2 (VLM Judge)

Discriminative Selector

Select high-efficiency subset

Model or implementation: Statistical selection (Point-biserial correlation)

Novel Architectural Elements

Use of point-biserial correlation for subset selection instead of rank-preservation optimization
Systematic rejection of blind-solvable samples using a multi-model text-only baseline
Hybrid evaluation pipeline replacing MCQs with Generative or Circular formats depending on task nature

Comparison to Prior Work

vs. VQA-v2/MMMU: DatBench filters out 40-70% of data (blind/noisy) and removes MCQ inflation
vs. Scales++: DatBench uses data-driven statistical discrimination (point-biserial) rather than subjective rubrics
vs. IRT-based methods: DatBench uses robust point-biserial correlation instead of full IRT parameter fitting, which requires larger response matrices
+ 1 more
vs. HELM [not cited in paper]: Holistic evaluation that aggregates many metrics, whereas DatBench focuses on compressing the evaluation set itself for efficiency

Limitations

Relies on a strong proprietary judge (GPT-5.2) for quality filtering, which may not be accessible to all
Discriminative selection depends on the initial set of 27 models; potential overfitting to current model capabilities
Aggressive filtering (up to 42%) might discard some valid but extremely difficult frontier examples

Reproducibility

Code: https://github.com/datologyai/DatBench

Datasets (DatBench, DatBench-Full) released on HuggingFace. Code available on GitHub. Judge model used (GPT-5.2) and specific scoring judge (Qwen3-30B) are specified. 27 distinct models used for baseline construction are listed.

📊 Experiments & Results

Evaluation Setup

Evaluation of 27 state-of-the-art VLMs (Qwen, InternVL, proprietary models) across 9 capabilities.

Benchmarks:

AI2D (Diagram Understanding)
VQA-v2 (General VQA)
MME-RealWorld (Autonomous Driving / Spatial)
DatBench (Composite Benchmark) [New]

Metrics:

Accuracy (Generative exact match or LLM-judge match)
MCQ Accuracy
Speedup factor
Statistical methodology: Point-biserial correlation for item selection. No explicit significance tests reported for model comparisons.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Format analysis reveals massive score inflation in standard MCQ benchmarks compared to generative evaluation.
AI2D	Average Accuracy	77.56	40.53	-37.03
Data quality analysis shows high rates of invalid or blind-solvable questions in popular benchmarks.
VQA-v2	Blind-Solvable Rate	0	70	+70
MME-RealWorld (Spatial)	Discard Rate	0	42.07	+42.07
Efficiency results demonstrate that the curated DatBench subset allows much faster evaluation.
DatBench Suite	Inference Speedup	1.0	13.0	+12.0

Experiment Figures

Scatter plots comparing MCQ accuracy vs. Generative accuracy (3a) and vs. Circular Evaluation (3b) across 27 models.

Curves showing Total Discriminative Power (7a) and Rank Correlation (7b) as a function of dataset size for Random vs. Discriminative sampling.

Main Takeaways

Current MCQ benchmarks inflate VLM capabilities by rewarding guessing and shortcuts; generative evaluation reveals true performance is much lower.
A massive portion of 'visual' benchmarks (up to 70% in VQA-v2) tests only language priors, not visual reasoning.
Selecting evaluation samples based on discriminative power (point-biserial correlation) is far more efficient than random sampling or simple rank preservation.
Inference-time scaling (thinking models) can degrade perceptual performance ('overthinking penalty'), a trend visible only after filtering noisy data.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision-Language Models (VLMs) and their evaluation metrics
Familiarity with Multiple Choice Question (MCQ) vs. Generative evaluation biases
Basic knowledge of Item Response Theory (IRT) concepts like item discrimination

Key Terms

VLM: Vision-Language Model—an AI model capable of processing and reasoning about both images and text

MCQ: Multiple Choice Question—a format where the model selects an answer from a list, prone to random guessing inflation

blind solvability: The phenomenon where a VLM can answer a visual question correctly using only the text prompt (language priors) without looking at the image

point-biserial correlation: A statistical measure used here to quantify 'item discrimination'—how well a specific question distinguishes between strong and weak models

Circular Evaluation: An evaluation method for MCQs where the options are cyclically permuted (rotated) across multiple passes; the model is only credited if it answers correctly in all permutations

LLM-as-judge: Using a strong Language Model to evaluate the correctness of another model's open-ended text generation

Item Response Theory (IRT): A psychometric paradigm used to design tests by modeling the relationship between a subject's ability and their probability of answering a specific item correctly