TSDF: Truncated Signed Distance Function—a method for representing 3D surfaces by storing the distance to the nearest surface in a voxel grid, useful for fusing depth maps.
World Model: A generative model that learns to simulate an environment's response to actions (like camera movement), effectively 'imagining' future states.
DiT: Diffusion Transformer—a neural network architecture for diffusion models that uses Transformer blocks instead of the traditional U-Net.
VAE: Variational Autoencoder—a neural network that compresses data (like images) into a smaller latent space for efficient processing.
CogVideoX: A specific open-source video diffusion model architecture used as the backbone for this work.
Working Memory: In this context, the most recent N frames used to ensure immediate temporal continuity and smooth motion.
Episodic Memory: A sparse set of past keyframes stored to help recall specific visual details when revisiting a location.
Spatial Memory: A global 3D representation (point cloud) of the static parts of the scene, updated incrementally.
CUT3R: A method used for online recurrent reconstruction of the static map.