Evaluation Setup
Evaluation on comprehensive multi-modal benchmarks covering perception, reasoning, and hallucination.
Benchmarks:
- MM-Bench (Comprehensive multi-modal evaluation (multiple choice))
- MME (Perception and Cognition evaluation)
- LLaVA-Bench (Open-ended conversation (Human preference proxy))
- POPE (Object hallucination evaluation)
- CF Benchmarks (Catastrophic Forgetting (CIFAR-10, CIFAR-100, MNIST, miniImageNet))
Metrics:
- Accuracy
- Score (MME total)
- Relative Score (LLaVA-Bench)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Vision-Flan-Base (Stage 1) achieves SOTA on academic benchmarks but scores low on chat-style alignment. Vision-Flan-Chat (Stage 2) restores alignment with minimal data. |
| MME |
Score |
1531.3 |
1537.8 |
+6.5
|
| MM-Bench |
Accuracy |
66.7 |
69.8 |
+3.1
|
| LLaVA-Bench |
Score |
70.7 |
78.3 |
+7.6
|
| POPE |
Accuracy |
83.6 |
86.1 |
+2.5
|
| CF (Average) |
Accuracy |
73.3 |
84.0 |
+10.7
|
| LLaVA-Bench |
Score |
63.9 |
78.3 |
+14.4
|
Main Takeaways
- Increasing the number of human-labeled tasks directly correlates with improved VLM capabilities across benchmarks.
- GPT-4 synthesized data mainly modulates response format (style) rather than adding fundamental capability; 1,000 instances are sufficient for this alignment.
- Excessive GPT-4 data introduces bias (e.g., saying 'Yes' too often), leading to increased hallucination.
- Visual instruction tuning primarily updates the LLM to understand visual features; the MLP connector's weights are largely established during pre-training.