← Back to Paper List

Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory

Yuqi Wu, Wenzhao Zheng, Jie Zhou, Jiwen Lu
Department of Automation, Tsinghua University
arXiv.org (2025)
Memory MM

📝 Paper Summary

Dense 3D Reconstruction Streaming/Online Perception Spatial Memory Mechanisms
Point3R enables efficient streaming 3D reconstruction by maintaining an explicit memory of 3D spatial pointers that dynamically aggregates features from incoming frames without global optimization.
Core Problem
Existing methods for multi-view dense reconstruction either require computationally expensive global optimization (reprocessing all frames) or use limited-capacity implicit memories that lose information from earlier frames.
Why it matters:
  • Real-world embodied agents require streaming perception to respond to surroundings in real-time, which global optimization methods cannot support.
  • Implicit memory approaches (like fixed-length tokens) suffer from forgetting and redundancy as the sequence length increases, degrading reconstruction quality over time.
  • Current pair-wise methods (like DUSt3R) are inefficient for multi-view settings because they require a separate post-hoc global alignment step.
Concrete Example: When a robot explores a large building, implicit memory methods like CUT3R eventually fill their fixed token buffer, forcing them to discard early room data. Point3R assigns a 3D pointer to every explored location, so the memory grows naturally with the scene size, preventing forgetting of the starting location.
Key Novelty
Explicit Spatial Pointer Memory
  • Instead of storing abstract features, the memory consists of '3D pointers'—explicit 3D coordinates coupled with spatial features—that map directly to physical locations in the scene.
  • New observations are fused into this memory based on spatial proximity; if a new point is near an existing pointer, their features are aggregated; if not, a new pointer is created.
  • A 3D hierarchical position embedding (extending RoPE to 3D) injects relative spatial information into the attention mechanism, guiding how current image tokens interact with stored memory pointers.
Architecture
Architecture Figure Figure 2
The overall pipeline of Point3R, illustrating how image tokens interact with the spatial pointer memory and how the memory is updated.
Evaluation Highlights
  • Achieves state-of-the-art performance on ScanNet reconstruction, effectively handling long sequences where fixed-memory baselines fail.
  • Training requires only 8 A800 GPUs for 15 days, which is noted as a low cost compared to similar large-scale reconstruction models.
  • Generalizes across 14 diverse datasets (indoor/outdoor, static/dynamic) without scene-specific optimization.
Breakthrough Assessment
8/10
Significantly improves the efficiency and scalability of neural 3D reconstruction by replacing global attention with a spatially-aware streaming memory, solving the 'forgetting' problem in long sequences.
×