World Model: A predictive model that simulates how an environment changes in response to agent actions, acting as a learned simulator.
VLM: Vision-Language Model—an AI that understands both images and text, used here to propose tasks (e.g., GPT-4).
VLA: Vision-Language-Action model—an AI that takes images and text instructions and outputs robot actions (e.g., OpenVLA).
SVD: Stable Video Diffusion—a latent diffusion model architecture for generating video, used here as the backbone for the world model.
Sim-to-Real: Transferring a policy learned in simulation (or a learned world model) to the physical real world.
Hallucination: In this context, when a video model generates physically impossible events (objects vanishing, warping) due to lack of training data.
Proprioceptive state: The robot's internal sense of its own body position (joint angles, gripper width).