VLA: Vision-Language-Action models—AI systems that process visual and textual inputs to generate direct robot control actions
Cerebrum: In this paper, the frozen Large Vision-Language Model (VLM) that provides high-level semantic planning and multimodal priors
Pons Adapter: A trainable module that compresses and translates high-dimensional features from the Cerebrum into compact tokens for the execution head
Cerebellum: The high-frequency control module (ParaCAT) that fuses perceptual inputs and Pons tokens to generate motor actions
ParaCAT: Parallel Categorical Action Transformer—the action head that predicts discrete action steps in parallel
ROI: Region of Interest—a specific cropped area of an image, here geometrically tied to the robot's end-effector
Hysteresis: A control strategy where the output state depends on history to prevent rapid switching (jitter) between values
EMA: Exponential Moving Average—a statistical technique to smooth data by weighting recent observations more heavily
Micro-horizon reuse: A strategy where a sequence of predicted actions (chunk) is executed sequentially without running the full model for every single step
LIBERO: A benchmark suite for evaluating lifelong robot learning and manipulation policies
SR_cn: Compute-normalized Success Rate—a metric proposed by the authors to evaluate success relative to computational cost