VLA: Vision-Language-Action modelโan AI system that takes visual and language inputs to directly generate physical control actions for a robot
ActionQuery: A learnable query token sequence fed into the VLM to aggregate multimodal information specifically for action generation
Bridge Attention: A proposed attention module that fuses 'Raw' VLM features and 'ActionQuery' features into the policy's action latent space
Raw Features: Direct feature representations extracted from intermediate or final layers of the pre-trained VLM backbone
Proprioception: The robot's internal sense of its own physical state, such as joint angles or gripper position
Prismatic-VLM: A specific VLM architecture used as the backbone, integrating visual encoders (DINOv2, SigLIP) with an LLM
LIBERO: A benchmark suite for evaluating lifetime robotic learning, containing tasks like spatial arrangement, object manipulation, and long-horizon goals
Action Chunking: Predicting a sequence of future actions (H steps) at once rather than just a single step, used to improve temporal consistency