VisionArena: 230K Real World User-VLM Conversations with Preference Labels

📝 Paper Summary

Vision-Language Models (VLMs) Human Preference Evaluation Instruction Tuning

VisionArena introduces a large-scale dataset of 230K real-world user-VLM conversations and a live benchmarking platform to capture open-ended human preferences and improve model alignment.

Core Problem

Existing VLM benchmarks focus on static, single-turn tasks with predetermined answers, failing to capture the open-ended, multi-turn, and evolving nature of real-world user interactions.

Why it matters:

Static benchmarks (like VQA) provide only a simplified snapshot of capabilities and overlook user intent in real-world scenarios
Understanding authentic user interactions is essential for aligning models with human expectations, particularly for complex tasks like creative writing or humor
Current automatic benchmarks often correlate poorly with actual human preference in live settings

Concrete Example: In a 'failure case' example, a user provides an image of a cat with a smaller cat-shaped pattern on its back. Current top models fail to understand the visual pun relating the pattern to a 'square root,' whereas a human (and ideally an aligned VLM) would grasp the humor and visual reasoning immediately.

Key Novelty

VisionArena Platform & Dataset

Integrates VLMs into the Chatbot Arena platform, collecting 230K real-world conversations including 'battles' where users vote on anonymous model outputs
Introduces VisionArena-Bench, an automatic evaluation pipeline using 500 diverse prompts and VLM-as-a-judge to cheaply approximate live human rankings
Demonstrates that fine-tuning on high-quality filtered conversations from the arena significantly boosts performance on downstream benchmarks compared to standard instruction datasets

Evaluation Highlights

VisionArena-Bench achieves 97.3% Spearman correlation with the live Chatbot Arena leaderboard, significantly outperforming WildVision-Bench (80.2%)
Fine-tuning Llama-3.2-11B on VisionArena-Chat yields a 46.5 point improvement on the human preference benchmark WV-Bench compared to fine-tuning on Llava-Instruct-158K
Fine-tuning on VisionArena-Chat improves HallusionBench performance by +369.4 points (1437.0 vs 1067.6) compared to Llava-Instruct-158K

Breakthrough Assessment

9/10

The dataset scale (230K) and the integration of live human preference into VLM benchmarking is a significant step forward. The strong correlation of their offline benchmark with live data makes it a highly practical tool.

⚙️ Technical Details

Problem Definition

Setting: Crowdsourced evaluation of Vision-Language Models via pairwise comparison and subsequent supervised fine-tuning

Inputs: User prompt p containing text and optional image I

Outputs: Preference label y indicating which model response (A or B) is better, or a tie

Pipeline Flow

User Interface (Unified Chat)
Routing System (Direct Chat vs. Battle)
Content Moderation (NSFW/PII filtering)
Ranking System (Bradley-Terry Model)

System Modules

Routing System

Directs user inputs to appropriate models

Model or implementation: Heuristic Router

VisionArena-Bench Judge

Approximates human preference by judging pairs of model outputs

Model or implementation: GPT-4o

Modeling

Base Model: Llama-3.2-11B-Vision

Training Method: Visual Instruction Tuning

Adaptation: Full fine-tuning of multimodal projector and language model (Vision encoder frozen)

Training Data:

100,000 conversations sampled from VisionArena-Chat
Filtered for highest-performing VLMs (e.g., GPT-4o, Gemini-1.5-Pro, Claude-3.5-Sonnet)

Key Hyperparameters:

epochs: 3

Compute: Not reported in the paper

Comparison to Prior Work

vs. WildVision: VisionArena has 3x more data, 45+ models (vs 19), 138 languages, and captures broader topics (coding, homework) beyond specific visual tasks
vs. LLaVA-Instruct: VisionArena uses real-world user prompts rather than synthetic/seed-based questions, leading to better generalization on human-preference benchmarks

Limitations

Dataset distribution is skewed toward STEM, OCR, and 'toy' problems (humor/riddles), underrepresenting medical or geospatial domains
Many of the 138 languages captured do not have enough examples to produce stable per-language leaderboards
Automated PII and NSFW filters are not infallible; some sensitive content may remain

Reproducibility

Code: https://github.com/lm-sys/FastChat

publicly available (https://huggingface.co/lmarena-ai). The dataset includes 230K conversations. The authors also release the Llama-3.2-VisionArena model fine-tuned on this data. Code for the platform and evaluation is available at https://github.com/lm-sys/FastChat.

📊 Experiments & Results

Evaluation Setup

Instruction fine-tuning evaluation and Benchmark correlation analysis

Benchmarks:

MMMU (Multi-discipline multimodal reasoning)
HallusionBench (Visual hallucination diagnosis)
WV-Bench (WildVision-Bench) (Human preference approximation)
VisionArena-Bench (Human preference approximation) [New]

Metrics:

Accuracy
Spearman Correlation
Kendall Tau Correlation
Statistical methodology: Bootstrap resampling (100 times) to construct confidence intervals for Bradley-Terry ratings

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Fine-tuning experiments demonstrate that training on real-world VisionArena data significantly outperforms training on synthetic datasets (Llava-Instruct) across multiple benchmarks.
MMMU	Accuracy	38.7	45.2	+6.5
WV-Bench	Score	10.4	56.9	+46.5
HallusionBench	Figure QA Accuracy	1067.6	1437.0	+369.4
Benchmark analysis shows VisionArena-Bench correlates much better with live human voting data than previous baselines.
Chatbot Arena Leaderboard	Spearman Correlation	80.2	97.3	+17.1
Chatbot Arena Leaderboard	Kendall Tau Correlation	69.2	89.7	+20.5

Experiment Figures

Impact of confounding variables (style) on user preferences across different task categories

Main Takeaways

Response style (e.g., length, markdown formatting) heavily influences human preference, especially in open-ended tasks like captioning and humor
Current VLMs struggle significantly with 'visual puns' and complex spatial reasoning (e.g., finding the 'square root' of a cat), often failing where humans succeed easily
Fine-tuning on diverse, real-world user interactions (VisionArena) yields far better generalization and alignment than training on static/synthetic datasets (Llava-Instruct)

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Large Language Models (LLMs) and Vision-Language Models (VLMs)
Understanding of Bradley-Terry models for ranking
Knowledge of instruction tuning and RLHF concepts

Key Terms

VLM: Vision-Language Model—an AI model capable of understanding and generating text based on visual inputs

Bradley-Terry Model: A statistical model used to estimate the relative skill or strength of competitors based on the outcomes of paired comparisons

Chatbot Arena: An open-source platform where users chat with anonymous models side-by-side and vote on which response they prefer

OCR: Optical Character Recognition—the conversion of images of typed, handwritten, or printed text into machine-encoded text

NER: Named Entity Recognition—identifying and classifying key information (entities) in text into predefined categories

CSAM: Child Sexual Abuse Material—harmful content that is automatically filtered out of the dataset

NSFW: Not Safe For Work—content containing nudity, violence, or other sensitive material

PII: Personally Identifiable Information—data that could identify a specific individual

MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark

HallusionBench: A benchmark designed to diagnose visual illusions and hallucinations in VLMs