← Back to Paper List

SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding

Jiawen Lin, Shiran Bian, Yihang Zhu, Wenbin Tan, Yachao Zhang, Yuan Xie, Yanyun Qu
School of Informatics, Xiamen University, School of Computer Science, Nanjing University, School of Computer Science and Technology, East China Normal University
arXiv (2025)
MM Reasoning Benchmark

📝 Paper Summary

3D Visual Grounding (3DVG) Zero-Shot Learning Vision-Language Models (VLMs)
SeqVLM performs zero-shot 3D visual grounding by generating 3D proposals, projecting them onto multi-view image sequences to preserve spatial context, and using an iterative VLM reasoning process to identify the target.
Core Problem
Existing zero-shot 3D visual grounding methods rely on single-view renderings or sparse point clouds, leading to spatial misalignment, loss of contextual details, and inability to handle occlusions.
Why it matters:
  • High annotation costs for 3D bounding boxes limit the scalability and generalization of supervised methods in real-world scenes
  • Single-view approaches fail to capture multi-object relationships and suffer from geometric inconsistencies between 2D projections and 3D coordinates
  • Directly using VLMs on raw point clouds is ineffective due to the modality gap and lack of color/texture detail
Concrete Example: Previous VLM-based methods might misalign a 'red chair near the window' because a single rendered view lacks depth or occludes the window. SeqVLM stitches multiple real-world views of the specific proposal into a vertical strip, allowing the VLM to see the chair from different angles alongside its context.
Key Novelty
Proposal-Guided Multi-View Sequence Reasoning
  • Instead of rendering synthetic views or using single snapshots, SeqVLM projects 3D proposals onto sequences of real-world images, cropping and stitching them to create a 'film strip' for each candidate object.
  • Introduces an iterative reasoning mechanism where the VLM processes batches of candidate sequences in rounds, filtering out irrelevant candidates step-by-step to avoid context window overload.
Evaluation Highlights
  • Achieves 55.6% Acc@0.25 on ScanRefer (Zero-Shot), surpassing the previous state-of-the-art by 4.0%
  • Achieves 53.2% Acc@0.25 on Nr3D (Zero-Shot), outperforming prior zero-shot methods by 5.2%
  • Performance is competitive with some fully supervised approaches despite using no 3D-text paired training data
Breakthrough Assessment
8/10
Significant performance jump over existing zero-shot baselines by addressing the key limitation of single-view bias. The multi-view sequence approach effectively bridges the gap between 3D geometry and VLM capabilities.
×