← Back to Paper List

DatBench: Discriminative, Faithful, and Efficient VLM Evaluations

DatologyAI, :, Siddharth Joshi, Haoli Yin, Rishabh Adiga, Ricardo Monti, Aldo Carranza, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Fan Pan, Haakon Mongstad, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Kaleigh Mentzer, Luke Merrick, Parth Doshi, Paul Burstein, Pratyush Maini, Scott Loftin, Spandan Das, Tony Jiang, Vineeth Dorna, Zhengping Wang, et al.
DatologyAI
arXiv (2026)
MM Benchmark Reasoning Factuality

📝 Paper Summary

Vision-Language Model Evaluation Benchmark Curation Efficient Evaluation
DatBench systematically improves VLM evaluation by transforming multiple-choice tasks into generative ones, removing questions solvable without visual input, and selecting high-signal examples to reduce compute costs by 13x.
Core Problem
Existing VLM benchmarks are inefficient and unfaithful: they rely on multiple-choice formats that inflate scores via guessing, contain many questions solvable without images, and suffer from labeling errors.
Why it matters:
  • Evaluation consumes up to 20% of total development compute for frontier models like OLMo3
  • 70% of questions in benchmarks like VQA-v2 are 'blindly solvable' using language priors alone, failing to test multimodal reasoning
  • Inflated scores from multiple-choice guessing (up to 30% gap) obscure genuine capability differences, causing researchers to 'hill-climb on noise'
Concrete Example: On the AI2D benchmark, models achieve 77.56% average accuracy on multiple-choice questions but drop to 40.53% when the options are removed, revealing that much of the perceived performance is due to guessing or elimination strategies rather than visual understanding.
Key Novelty
Data-Centric Evaluation Curation (Transformation, Filtering, Selection)
  • Faithfulness via Transformation: Converts Multiple Choice Questions (MCQs) into open-ended generative tasks judged by an LLM to prevent guessing, or uses Circular Evaluation to rotate options when MCQs are necessary.
  • Discriminative Efficiency: Instead of random sampling, selects evaluation subsets using point-biserial correlation to maximize the separation between strong and weak models, achieving high signal density with fewer samples.
Evaluation Highlights
  • 13x average speedup (up to 50x) across 9 capabilities compared to full benchmarks while maintaining discriminative power
  • Identified and removed 70%+ of VQA-v2 samples that were solvable by text-only models without visual input
  • Revealed a ~35 point accuracy drop on AI2D for strong models when switching from MCQ to generative evaluation, correcting inflated capability estimates
Breakthrough Assessment
9/10
Critically addresses the 'evaluation crisis' in VLMs by exposing massive score inflation and inefficiency. The shift from rank-preservation to discrimination-maximization for subset selection is a methodological advance.
×