Evaluation Setup
Zero-shot evaluation on multiple-choice questions
Benchmarks:
- MME-RealWorld (Real-world visual perception and reasoning) [New]
- MME-RealWorld-CN (Chinese-native visual perception) [New]
Metrics:
- Accuracy (Avg)
- Class-based Average Accuracy (Avg-C)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Overall performance of state-of-the-art models on MME-RealWorld shows that no model reaches 60% accuracy, indicating extreme difficulty. |
| MME-RealWorld |
Accuracy |
24.9 |
59.0 |
+34.1
|
Main Takeaways
- Even the most advanced models (GPT-4o, Gemini 1.5 Pro) fail to surpass 60% accuracy, significantly lower than the 80-90% seen on traditional benchmarks.
- High resolution is critical: Tasks like counting vehicles or reading small text in remote sensing images are major failure points for current MLLMs.
- There is a massive gap between model performance and human capability in complex real-world scenarios like autonomous driving and video surveillance.