VLA: Vision-Language-Action model—a foundation model trained to output robot actions directly from vision and language inputs
System 1: In cognitive science/AI, the component responsible for fast, intuitive, and automatic execution (acting) without explicit deliberation
System 2: The component responsible for slow, deliberate, and logical processing (reasoning/planning)
Flow Matching: A generative modeling technique used to train the continuous action distribution, serving as the 'action head' of the model
[BOR]: Beginning of Reasoning—a special decision token indicating the model should generate text reasoning
[BOA]: Beginning of Action—a special decision token indicating the model should generate physical robot actions
Co-training: Training a model simultaneously on multiple datasets (here, robot demonstration data and synthetic vision-language data) to transfer capabilities
DoF: Degrees of Freedom—the number of independent parameters that define the configuration of a robotic arm (e.g., 7-DoF arm)