VLA: Vision Language Action model—a unified neural network that takes vision and language inputs and directly outputs robot actions
ViT: Vision Transformer—a model architecture that processes images as sequences of patches using attention mechanisms
proprioceptive state: Internal sensing of the robot's own body, such as joint angles or gripper position
RTC: Real-Time Chunking—an inference strategy where the robot predicts a chunk of future actions while simultaneously executing the previous chunk to maintain smooth motion
spatial-temporal attention: An attention mechanism that separates processing of space (pixels within a frame) and time (pixels across frames) to save computation
flow-matching: A generative modeling technique used to predict continuous distributions (like robot actions) by learning vector fields
LLM: Large Language Model—a generic text-processing AI model
VLM: Vision-Language Model—an AI model trained on both images and text
token: The basic unit of data processed by a Transformer (e.g., a word part or an image patch)
inference latency: The time delay between receiving an input and generating a response