Evaluation Setup
Descriptive image captioning and general multimodal benchmarks
Benchmarks:
- MMBench (General VLM capability)
- MMStar (General VLM capability)
- MathVista (Visual Math Reasoning)
- HallusionBench (Hallucination evaluation)
- LLaVA-Bench (General conversation)
Metrics:
- GPT-4o Evaluation (Win rate)
- Human Evaluation (Win rate)
- Standard benchmark scores (Accuracy/Score)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Inference-time search with VisVM significantly improves caption quality over baselines according to GPT-4o and human judges. |
| Descriptive Captioning |
Win Rate vs Greedy |
50.0 |
74.0 |
+24.0
|
| Self-training with VisVM-generated captions leads to consistent improvements across standard VLM benchmarks. |
| Average across 9 benchmarks |
Average Score |
63.2 |
70.0 |
+6.8
|
| Average across 9 benchmarks |
Average Score |
66.8 |
71.7 |
+4.9
|
| HallusionBench |
Score |
39.6 |
44.7 |
+5.1
|
Main Takeaways
- VisVM-guided search is superior to both Greedy decoding and CLIP-PRM search, confirming the value of 'lookahead' value estimation over immediate reward.
- The improvements transfer effectively to smaller models via self-training: training on VisVM-generated captions boosts base model performance significantly.
- Improvements are consistent across different base architectures (LLaVA-Next, Qwen2-VL), suggesting the method is architecture-agnostic.
- The method reduces hallucinations specifically, as evidenced by gains on HallusionBench and qualitative human evaluation.