← Back to Paper List

VisionArena: 230K Real World User-VLM Conversations with Preference Labels

Christopher Chou, Lisa Dunlap, Koki Mashita, Krishna Mandal, Trevor Darrell, I. Stoica, Joseph Gonzalez, Wei-Lin Chiang
Stanford University, University of California, Berkeley
Computer Vision and Pattern Recognition (2024)
MM Benchmark RL

📝 Paper Summary

Vision-Language Models (VLMs) Human Preference Evaluation Instruction Tuning
VisionArena introduces a large-scale dataset of 230K real-world user-VLM conversations and a live benchmarking platform to capture open-ended human preferences and improve model alignment.
Core Problem
Existing VLM benchmarks focus on static, single-turn tasks with predetermined answers, failing to capture the open-ended, multi-turn, and evolving nature of real-world user interactions.
Why it matters:
  • Static benchmarks (like VQA) provide only a simplified snapshot of capabilities and overlook user intent in real-world scenarios
  • Understanding authentic user interactions is essential for aligning models with human expectations, particularly for complex tasks like creative writing or humor
  • Current automatic benchmarks often correlate poorly with actual human preference in live settings
Concrete Example: In a 'failure case' example, a user provides an image of a cat with a smaller cat-shaped pattern on its back. Current top models fail to understand the visual pun relating the pattern to a 'square root,' whereas a human (and ideally an aligned VLM) would grasp the humor and visual reasoning immediately.
Key Novelty
VisionArena Platform & Dataset
  • Integrates VLMs into the Chatbot Arena platform, collecting 230K real-world conversations including 'battles' where users vote on anonymous model outputs
  • Introduces VisionArena-Bench, an automatic evaluation pipeline using 500 diverse prompts and VLM-as-a-judge to cheaply approximate live human rankings
  • Demonstrates that fine-tuning on high-quality filtered conversations from the arena significantly boosts performance on downstream benchmarks compared to standard instruction datasets
Evaluation Highlights
  • VisionArena-Bench achieves 97.3% Spearman correlation with the live Chatbot Arena leaderboard, significantly outperforming WildVision-Bench (80.2%)
  • Fine-tuning Llama-3.2-11B on VisionArena-Chat yields a 46.5 point improvement on the human preference benchmark WV-Bench compared to fine-tuning on Llava-Instruct-158K
  • Fine-tuning on VisionArena-Chat improves HallusionBench performance by +369.4 points (1437.0 vs 1067.6) compared to Llava-Instruct-158K
Breakthrough Assessment
9/10
The dataset scale (230K) and the integration of live human preference into VLM benchmarking is a significant step forward. The strong correlation of their offline benchmark with live data makes it a highly practical tool.
×