SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding

📝 Paper Summary

3D Visual Grounding (3DVG) Zero-Shot Learning Vision-Language Models (VLMs)

SeqVLM performs zero-shot 3D visual grounding by generating 3D proposals, projecting them onto multi-view image sequences to preserve spatial context, and using an iterative VLM reasoning process to identify the target.

Core Problem

Existing zero-shot 3D visual grounding methods rely on single-view renderings or sparse point clouds, leading to spatial misalignment, loss of contextual details, and inability to handle occlusions.

Why it matters:

High annotation costs for 3D bounding boxes limit the scalability and generalization of supervised methods in real-world scenes
Single-view approaches fail to capture multi-object relationships and suffer from geometric inconsistencies between 2D projections and 3D coordinates
Directly using VLMs on raw point clouds is ineffective due to the modality gap and lack of color/texture detail

Concrete Example: Previous VLM-based methods might misalign a 'red chair near the window' because a single rendered view lacks depth or occludes the window. SeqVLM stitches multiple real-world views of the specific proposal into a vertical strip, allowing the VLM to see the chair from different angles alongside its context.

Key Novelty

Proposal-Guided Multi-View Sequence Reasoning

Instead of rendering synthetic views or using single snapshots, SeqVLM projects 3D proposals onto sequences of real-world images, cropping and stitching them to create a 'film strip' for each candidate object.
Introduces an iterative reasoning mechanism where the VLM processes batches of candidate sequences in rounds, filtering out irrelevant candidates step-by-step to avoid context window overload.

Evaluation Highlights

Achieves 55.6% Acc@0.25 on ScanRefer (Zero-Shot), surpassing the previous state-of-the-art by 4.0%
Achieves 53.2% Acc@0.25 on Nr3D (Zero-Shot), outperforming prior zero-shot methods by 5.2%
Performance is competitive with some fully supervised approaches despite using no 3D-text paired training data

Breakthrough Assessment

8/10

Significant performance jump over existing zero-shot baselines by addressing the key limitation of single-view bias. The multi-view sequence approach effectively bridges the gap between 3D geometry and VLM capabilities.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot 3D Visual Grounding: Localize a target object O* in a 3D scene (point cloud P) given a text description T, without training on scene-text pairs.

Inputs: Colored point cloud P (XYZ + RGB), list of real-world images with camera poses, textual query T

Outputs: 3D bounding box coordinates of the target object O*

Pipeline Flow

Proposal Generation: 3D Segmentation → Semantic Filtering
Visual Representation: Proposal-Guided Projection → Multi-View Stitching
Reasoning: Iterative VLM Inference → Final Bounding Box Selection

System Modules

3D Semantic Segmentation Network (Proposal Generation)

Extract object instance masks and categories from the raw point cloud

Model or implementation: Mask3D [implied context, standard in field]

Semantic Filter (Proposal Generation)

Filter proposals to keep only those semantically matching the target category derived from text

Model or implementation: Text Encoder (e.g., CLIP-based) + LLM for query parsing

Proposal-Guided Multi-View Projector (Visual Representation)

Project 3D proposals onto 2D images and crop relevant regions with context

Model or implementation: Geometric projection (Pinhole model)

Sequence Generator (Visual Representation)

Stitch top-k projected images into a vertical strip for each proposal

Model or implementation: Image processing (Crop + Resize + Concat)

Iterative VLM Reasoner

Select the correct proposal sequence matching the text description via multi-round elimination

Model or implementation: GPT-4o (implied generic VLM)

Novel Architectural Elements

Proposal-guided multi-view projection strategy that creates 'film strips' of static objects from real scene images
Iterative batch-based reasoning loop for VLMs to handle large candidate sets without context overflow

Modeling

Base Model: VLM (e.g., GPT-4o or similar state-of-the-art VLM)

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLM-Grounder: Uses visual information (images) rather than just geometric/semantic point cloud data, handling texture/color descriptions better
vs. VLM-Grounder: Projects 3D proposals to 2D (Proposal-to-Image) ensuring geometric consistency, whereas VLM-Grounder detects in 2D then projects to 3D (Image-to-Proposal) causing misalignment
vs. SeeGround: Uses multi-view sequences of real images rather than single rendered views, preserving better spatial context and avoiding rendering artifacts

Limitations

Relies on the quality of the upstream 3D semantic segmentation network; poor proposals cannot be recovered
Requires accurate camera poses and dense real-world image coverage of the scene
Iterative VLM reasoning can be slow and computationally expensive compared to single-pass methods
Performance depends heavily on the VLM's ability to interpret vertical image strips and complex spatial prompts

Reproducibility

Code: https://github.com/JiawLin/SeqVLM

Code is publicly available at https://github.com/JiawLin/SeqVLM. The paper relies on existing 3D segmentation backbones and off-the-shelf VLMs/LLMs (specific VLM version not explicitly named in text but implies strong models like GPT-4V/o). Real-world image poses are required.

📊 Experiments & Results

Evaluation Setup

Zero-shot localization on standard 3D visual grounding datasets

Benchmarks:

ScanRefer (3D Visual Grounding in indoor scenes)
Nr3D (3D Visual Grounding (ReferIt3D))

Metrics:

Acc@0.25 (IoU >= 0.25)
Acc@0.5 (IoU >= 0.5)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot performance comparisons on ScanRefer and Nr3D benchmarks show SeqVLM outperforming prior state-of-the-art methods.
ScanRefer	Acc@0.25	51.6	55.6	+4.0
Nr3D	Acc@0.25	48.0	53.2	+5.2

Main Takeaways

State-of-the-art zero-shot performance on both ScanRefer and Nr3D, demonstrating robust generalization without scene-specific training
Proposal-guided projection effectively bridges the domain gap between 3D point clouds and 2D VLM inputs
Multi-view sequences provide critical spatial cues that single-view methods miss, particularly for occlusion handling
Iterative reasoning allows the use of powerful VLMs on large candidate sets by managing token limits dynamically

📚 Prerequisite Knowledge

Prerequisites

3D Semantic Segmentation
Vision-Language Models (VLMs)
Pinhole Camera Model (projection matrices)

Key Terms

3DVG: 3D Visual Grounding—locating objects in 3D space based on natural language descriptions

VLM: Vision-Language Model—AI models capable of understanding and reasoning about both image and text inputs (e.g., GPT-4o, Gemini)

Zero-Shot: The ability of a model to perform a task without having been explicitly trained on examples of that specific task

Acc@0.25: Accuracy metric measuring if the Intersection over Union (IoU) between the predicted and ground truth bounding boxes is at least 0.25

Acc@0.5: Accuracy metric measuring if the Intersection over Union (IoU) is at least 0.5

Open-Vocabulary: The ability to recognize and process object categories that were not present in the training set

Point Cloud: A set of data points in space representing a 3D shape or object

Homogeneous Transformation Matrix: A 4x4 matrix used to describe the rotation and translation between two coordinate systems (e.g., world to camera)

IoU: Intersection over Union—a metric used to evaluate the overlap between two bounding boxes