VLA: Vision-Language-Action Model—a robot policy directly built upon a pre-trained Vision-Language Model backbone
Cross-embodiment data: Robotic datasets collected from various different robot types (embodiments) and environments, used to learn generalizable skills
Open X-Embodiment (OXE): A large-scale open-source dataset containing robot manipulation trajectories from many different institutions and robot platforms
CALVIN: A simulation benchmark for long-horizon, language-conditioned robot manipulation tasks
SimplerEnv: A simulation environment designed to evaluate how well robot policies transfer from real-world training data to simulation (Sim-to-Real/Real-to-Sim evaluation)
Policy Head: A specific neural network module added to a VLM to project high-dimensional features into robot actions, as opposed to generating actions as text tokens
Interleaved Modeling: Feeding historical images and actions into the VLM as a sequence of alternating tokens within the context window
Flow Matching: A generative modeling technique (related to diffusion models) used to predict action distributions by learning a velocity field
DoF: Degrees of Freedom—the number of independent parameters that define the configuration or state of a robot system