← Back to Paper List

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Yifan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, Tien-Ping Tan
Meta AI
arXiv.org (2025)
MM Benchmark QA

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Visual Question Answering (VQA) Benchmarks
MME-RealWorld is a large-scale, fully human-annotated multimodal benchmark focusing on high-resolution, real-world scenarios where current state-of-the-art models fail to reach 60% accuracy.
Core Problem
Existing MLLM benchmarks suffer from small data scales, low-quality model-based annotations, low image resolution, and insufficient difficulty, failing to reflect real-world challenges.
Why it matters:
  • Current benchmarks with 80-90% accuracy have saturated, making it hard to distinguish improvements between advanced models
  • Model-generated annotations introduce noise and upper bounds on quality (e.g., best annotator models only achieve 50% accuracy)
  • Low-resolution images in existing sets miss critical details needed for real-world tasks like remote sensing or reading complex charts
Concrete Example: In video monitoring, a model must count exactly 133 vehicles, or in remote sensing, identify small objects on a map with resolution >5000x5000. Current models often approach random guessing on these tasks.
Key Novelty
Large-scale High-Resolution Human-Annotated Benchmark
  • Collects >13K high-resolution images (avg 2000x1500) from real-world domains like autonomous driving and finance, significantly sharper than prior benchmarks
  • Uses a fully manual annotation pipeline with cross-checking by experts to ensure 100% human accuracy, avoiding the errors inherent in model-generated labels
  • Design specifically for 'hard-for-human' difficulty, including options that require rejecting the answer (Option E) to test robustness
Evaluation Highlights
  • State-of-the-art models (GPT-4o, Gemini 1.5 Pro) fail to surpass 60% accuracy, highlighting a massive gap between current capabilities and real-world needs
  • Baseline LLaVA-1.5-7B achieves only 24.9% accuracy, significantly lower than its performance on traditional benchmarks
  • Includes a Chinese-specific subset (MME-RealWorld-CN) to avoid translation artifacts common in other benchmarks
Breakthrough Assessment
9/10
Sets a new standard for difficulty and data quality in MLLM evaluation. The shift to high-resolution, fully human-annotated data exposes the fragility of current SOTA models that appeared 'solved' on easier benchmarks.
×