Evaluation Setup
Multimodal Hallucination and General Multimodal Understanding
Benchmarks:
- POPE (Object Hallucination (Yes/No polling))
- GAVIE (Hallucination & Instruction Following (GPT-4 evaluated))
- MMHal-Bench (Open-ended Hallucination Evaluation)
- MM-Vet (Complex Multimodal Tasks)
- MMBench (Visual Perception & Reasoning)
Metrics:
- Accuracy
- F1 Score
- Relevancy
- GPT-4 Score (MMHal-Bench)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Aggregated Hallucination Benchmarks |
Performance Enhancement |
Not reported in the paper |
Not reported in the paper |
+24.9%
|
| Multimodal Understanding (Math) |
Math Score |
Not reported in the paper |
Not reported in the paper |
2x
|
Main Takeaways
- Volcano consistently outperforms LLaVA-1.5 and baseline correctors (LURE, Woodpecker) across hallucination benchmarks, indicating that self-feedback is more effective than external revision models.
- The 'decision' stage (Stage 3) is crucial; skipping it and always accepting the revised response leads to lower performance, suggesting the model is better at discriminating quality than generating it.
- Providing natural language feedback is more effective than scalar rewards (RLHF) or direct revision without feedback (LURE), as it carries fine-grained visual information to guide the correction.
- Performance improves with the number of allowed iterations, though this introduces a trade-off with inference latency.