Pointmap: A dense 2D map where each pixel contains the 3D coordinates (x, y, z) of the corresponding point in the scene
DUSt3R: Dense Unconstrained Stereo 3D Reconstruction—a prior method that regresses pointmaps from image pairs without camera calibration
Spatial Memory: An external memory bank storing key-value pairs of geometric and visual features from previous frames to guide future predictions
Structure-from-Motion (SfM): A photogrammetry range imaging technique for estimating three-dimensional structures from two-dimensional image sequences
Bundle Adjustment (BA): An optimization step in 3D reconstruction that refines 3D coordinates and camera parameters by minimizing reprojection error
ViT: Vision Transformer—a model architecture that processes images as sequences of patches using self-attention mechanisms
SLAM: Simultaneous Localization and Mapping—constructing a map of an unknown environment while keeping track of an agent's location within it
MLP: Multilayer Perceptron—a class of feedforward artificial neural network
X-Mem: A video object segmentation method that introduced memory consolidation (working vs. long-term memory), which Spann3R adapts