Evaluation Setup
Probing LVLMs on object existence using MSCOCO, A-OKVQA, and GQA datasets.
Benchmarks:
- MSCOCO (Val Set) (Object Hallucination Evaluation)
- A-OKVQA (Object Hallucination Evaluation (via SEEM))
- GQA (Object Hallucination Evaluation (via SEEM))
Metrics:
- F1 Score
- Accuracy
- Precision
- Recall
- Yes Ratio (percent of 'Yes' answers)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Results on MSCOCO using the POPE pipeline show that InstructBLIP is significantly more robust to hallucination than other models, while models like mPLUG-Owl and LLaVA exhibit extreme overconfidence (Yes Rate ~99%). |
| MSCOCO |
F1 Score |
68.06 |
89.29 |
+21.23
|
| MSCOCO |
F1 Score |
66.98 |
78.45 |
+11.47
|
| MSCOCO |
Yes Ratio (Random) |
50.00 |
95.37 |
+45.37
|
| MSCOCO |
Std Dev (Prompt Variation) |
3.22 |
0.78 |
-2.44
|
| MSCOCO |
CHAIR_S |
13.0 |
32.7 |
+19.7
|
Main Takeaways
- Most LVLMs (except InstructBLIP) suffer from severe object hallucination, often defaulting to 'Yes' for any object query.
- Hallucinations are not random; they are strongly biased toward objects that appear frequently in instruction tuning data or co-occur with present objects.
- Visual instruction tuning appears to exacerbate hallucination compared to smaller pre-trained models (VLPMs), possibly due to hallucinations inherent in the synthetic instruction data used for training.
- POPE offers a more stable and scalable evaluation method than CHAIR, especially when combined with automatic segmentation tools like SEEM.