VLM: Vision-Language Model—a model trained on images and text to understand visual content semantically
VLA: Vision-Language-Action model—a VLM fine-tuned to output robot actions alongside or instead of text
Flow Matching: A generative modeling technique related to diffusion that learns a vector field to transform a simple noise distribution into a complex data distribution (used here for actions)
Action Chunking: Predicting a sequence (chunk) of future actions at once rather than just the single next action, which helps with temporal consistency
Proprioception: The robot's internal sense of its own body position (e.g., joint angles)
Cross-embodiment: Training a single model on data from multiple different types of robots (embodiments) with different physical structures
DoF: Degrees of Freedom—the number of independent parameters that define the robot's configuration
RGB: Red-Green-Blue—standard color image format