VLA: Vision-Language-Action model—a foundation model that takes visual and text inputs and directly outputs robot actions.
System 1: In dual-process theory, the fast, intuitive, and unconscious mode of thinking; here, the high-frequency motor control module.
System 2: In dual-process theory, the slow, logical, and deliberate mode of thinking; here, the VLM reasoning about high-level tasks.
Diffusion Policy: A method for generating robot actions by gradually denoising random noise, allowing for multimodal and precise action distributions.
Action Chunking: Predicting a sequence of future actions (a chunk) at once rather than just the single next step, used to handle temporal dependencies and latency.
Asynchronous Frequency: Running different parts of the model at different speeds; System 2 updates context slowly, while System 1 generates actions quickly.
Proprioception: The robot's internal sense of its own joint positions and velocities.
SE(3): Special Euclidean group in 3D—representing position (x, y, z) and orientation (rotation) of the robot end-effector.