Evaluation Setup
Image captioning on standard benchmarks evaluating detail, compositionality, and hallucinations.
Benchmarks:
- DetailCaps (Fine-grained image description)
- COMPOSITIONCAP (Compositional generalization)
- POPE (Hallucination suppression / Object existence)
Metrics:
- Detailedness / Quality scores (likely CLIP-based)
- Hallucination rates
- Compositional correctness
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- The TDSR framework significantly improves fine-grained description capabilities of base VLMs (LLaVA-1.5, Qwen2.5-VL).
- The method successfully suppresses hallucinations compared to non-planning baselines by adhering to a global plan.
- The lightweight value network and parallel expansion mechanisms successfully reduce computational costs by an order of magnitude compared to standard MCTS implementations.