← Back to Paper List

Video World Models with Long-term Spatial Memory

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, Gordon Wetzstein
Stanford University, The Chinese University of Hong Kong, Shanghai AI Laboratory
arXiv.org (2025)
Memory MM Benchmark

📝 Paper Summary

Video World Models Long-term Video Generation 3D Spatial Memory
This paper introduces a video world model that maintains long-term 3D consistency by anchoring generation in a geometry-grounded spatial memory (point cloud) alongside working and episodic memory streams.
Core Problem
Existing autoregressive video models suffer from limited context windows, causing them to forget previously generated environments and lose spatial consistency when revisiting scenes.
Why it matters:
  • Current methods rely on sliding windows of frames, which leads to drift and hallucinations when the camera returns to a previous location.
  • Generating infinite-length, consistent worlds is critical for simulators in robotics and interactive graphics, but current models lack persistent 3D understanding.
Concrete Example: If a camera moves to the left and then returns to the right, standard models often generate a different building or object than what was originally there because the original frames have fallen out of the context window.
Key Novelty
Geometry-Grounded Long-Term Spatial Memory
  • Maintains a persistent 3D point cloud of the static environment (Spatial Memory) using TSDF fusion to filter out dynamic objects, ensuring the world layout remains consistent.
  • Combines this with Short-Term Working Memory (recent frames) for smooth motion and Episodic Memory (sparse keyframes) to recall specific visual details from the past.
  • Uses a custom conditioning mechanism to render the static point cloud into a guide video, which directs the diffusion model's generation.
Architecture
Architecture Figure Figure 1
Overview of the memory-augmented video generation framework. It illustrates the interaction between the Spatial Memory (point cloud), Working Memory (recent frames), and Episodic Memory (keyframes) feeding into the diffusion model.
Evaluation Highlights
  • Outperforms baselines in 3D consistency and video quality on a custom 90K-sample dataset derived from MiraData.
  • Demonstrates ability to maintain scene consistency during loop closures where camera revisits previous locations, unlike standard autoregressive models.
  • Validates the decoupling of static scene geometry from dynamic object generation through ablation studies.
Breakthrough Assessment
7/10
Strong conceptual contribution by integrating explicit 3D geometry into video diffusion for consistency. The approach is rigorous, but reliance on a custom dataset and lack of standard benchmark comparisons limits broader impact assessment.
×