← Back to Paper List

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

Wang Xiyao, Zhengyuan Yang, Linjie Li, Hongjin Lu, Yuancheng Xu, Lin Lin, L. Kevin, Furong Huang, Lijuan Wang
University of Maryland, College Park, Microsoft
arXiv.org (2024)
MM RL Reasoning Factuality

📝 Paper Summary

Vision-Language Models (VLMs) Inference-time search Hallucination reduction
VisVM guides Vision-Language Models during inference by predicting the long-term value of generated sentences, enabling search strategies that produce more detailed and less hallucinated captions.
Core Problem
VLMs often suffer from visual hallucinations and lack detail in descriptive captioning because standard decoding methods (like greedy search) focus only on immediate token likelihood rather than global coherence or visual alignment.
Why it matters:
  • Hallucinations in VLMs limit their reliability for real-world applications where visual accuracy is critical
  • Scaling training data is expensive and hits diminishing returns; enhancing inference-time computation offers a scalable alternative path to quality
  • Existing reward models for LLMs (math/code) have clear outcome measures, but visual tasks lack straightforward signals for evaluating partial descriptions
Concrete Example: When describing a complex scene, a standard VLM might generate a sentence mentioning an object that isn't there (hallucination) or stop early with a vague summary. VisVM-guided search anticipates that a vague sentence leads to poor future descriptions, steering the model toward a more detailed, accurate path.
Key Novelty
Vision Value Model (VisVM) for Inference-Time Search
  • Trains a value network using Temporal Difference (TD) learning to predict the long-term quality of a partial caption, rather than just its immediate relevance
  • Uses the VLM's own visual encoder (like CLIP or SigLIP) as a Process Reward Model (PRM) to ground the value signal in visual similarity without needing human labels
  • Creates a self-improving loop where high-quality captions found via search are used to fine-tune the original model
Evaluation Highlights
  • VisVM-guided captions are preferred 74% of the time over greedy decoding baselines in human evaluation
  • +10.8% average improvement across 9 multimodal benchmarks for LLaVA-Next-7B after self-training on VisVM-generated captions
  • +7.3% average improvement for Qwen2-VL-7B after self-training, showing the approach generalizes across model architectures
Breakthrough Assessment
8/10
Successfully transfers the 'inference-time search' paradigm (popularized by OpenAI o1) to vision-language tasks. The self-improvement loop is particularly promising, demonstrating that compute at inference can substitute for expensive annotation.
×