← Back to Paper List

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

Yujie Lu, Dongfu Jiang, Wenhu Chen, W. Wang, Yejin Choi, Bill Yuchen Lin
Allen Institute of AI, University of Washington, University of California, Santa Barbara, University of Waterloo
Neural Information Processing Systems (2024)
MM Benchmark RL

📝 Paper Summary

Vision-Language Model Evaluation Human Preference Benchmarking
WildVision introduces a crowdsourced arena for evaluating vision-language models via pairwise human voting and derives a static benchmark where GPT-4o judges align highly with human preferences.
Core Problem
Current VLM benchmarks are too simple or static to capture real-world use cases, and existing metrics often fail to align with human preferences in complex multimodal interactions.
Why it matters:
  • Static benchmarks (like MMMU or MMVet) often saturate quickly or fail to reflect the diverse, messy nature of real-world user queries
  • Reference-based metrics (exact match) do not capture the nuance of helpfulness and instruction-following in open-ended chat
  • There is a gap between automated metrics and human preference when comparing many models at scale
Concrete Example: In a failure case from the paper, GPT-4V fails to identify a specific character (Astarion from Baldur's Gate 3) due to lack of gaming domain knowledge, while Gemini-Pro-Vision hallucinates details about a blurred license plate that is unreadable.
Key Novelty
WildVision-Arena & WildVision-Bench
  • Establishes a 'Chatbot Arena' for vision models where users chat with two anonymous models side-by-side and vote on the winner, generating Elo ratings
  • Creates a static benchmark (WV-Bench) by sampling 500 high-quality interactions from the arena and using GPT-4o as an automated judge to approximate human rankings
Evaluation Highlights
  • GPT-4o judge on WV-Bench achieves a 0.94 Spearman correlation with human-voted Elo ratings from the live Arena
  • GPT-4o dominates the Arena leaderboard with a 77% win rate against the second-best model (GPT-4V)
  • Agreement between experts and arena users is substantial (72.5% agreement, Cohen's Kappa 0.59), validating crowdsourced data quality
Breakthrough Assessment
9/10
Establishes the definitive 'Arena' for multimodal models, mirroring the success of LMSYS Chatbot Arena. The high correlation of the automated benchmark makes it a standard-setting tool for VLM evaluation.
×