VLA: Vision-Language-Action model—a multimodal AI that takes vision and text inputs to generate robotic actions.
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes advantages within a sampled group of trajectories to reduce variance, often used for reasoning tasks.
SFT: Supervised Fine-Tuning—training a model on labeled examples (here, synthetic CoT data) to initialize behavior before RL.
CoT: Chain-of-Thought—a prompting or training technique where the model generates intermediate reasoning steps before the final answer.
Proprioceptive states: Data regarding the robot's own internal status, such as joint angles or gripper position.
ZOOM-IN: The specific visual tool used in this paper, allowing the model to inspect fine-grained details of a specific image region.
Action Chunking: Predicting a sequence (chunk) of actions at once rather than a single step, used to improve temporal consistency.