Evaluation Setup
Multi-choice question answering across diverse domains
Benchmarks:
- MMStar (Visual-indispensable multi-modal QA) [New]
- MMBench (General multi-modal QA)
- MMMU (Multi-discipline expert QA)
- ScienceQA (Scientific QA)
- MathVista (Mathematical reasoning)
- AI2D (Diagram understanding)
- SEED (General multi-modal QA)
Metrics:
- Accuracy
- Multi-modal Gain (MG)
- Multi-modal Leakage (ML)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance on the new MMStar benchmark highlights the difficulty of truly vision-dependent tasks. |
| MMStar |
Accuracy |
41.8 |
57.1 |
+15.3
|
| MMStar |
Accuracy |
51.4 |
57.1 |
+5.7
|
| Investigation of visual independency (answering without images) on existing benchmarks. |
| ScienceQA |
Abnormal Hit Rate |
0 |
57.2 |
+57.2
|
| MMMU |
Accuracy (Text-only) |
24.8 |
43.6 |
+18.8
|
| Data Leakage analysis showing LVLMs memorizing training data. |
| MMMU |
Accuracy (Text-only) |
25.7 |
43.6 |
+17.9
|
Main Takeaways
- Many existing benchmarks (ScienceQA, AI2D) have high rates of questions answerable by text alone, failing to test visual capabilities.
- Significant data leakage exists: LVLMs often outperform their base LLMs on text-only versions of multi-modal tasks, proving they memorized the samples during training.
- MMStar is significantly harder than previous benchmarks; even GPT-4V only achieves 57.1% accuracy.
- Fine-grained perception and logical reasoning remain major challenges for current SOTA LVLMs.