VLA: Vision-Language-Action models—models that take vision and language as input and directly output robot actions
VLM: Vision-Language Model—a large transformer model trained on text and images to generate text (or tokens)
2D Path: A sequence of normalized 2D coordinates [(x, y, gripper)] on the image plane representing the desired end-effector trajectory
Proprioception: The robot's internal sense of its own joint positions and gripper state
Off-domain data: Data collected from sources different from the test environment, such as simulation, videos of humans, or different robot bodies
Sim-to-real: The challenge of transferring policies learned in physics simulation to the real physical world despite differences in visuals and physics
RVT-2: A specific 3D-aware robot policy architecture (Robotic View Transformer) that uses multi-view 3D representations
Ramer-Douglas-Peucker: An algorithm used to simplify a curve composed of line segments into a similar curve with fewer points