VLA: Vision-Language-Action model—a multimodal model that takes vision and language as input and outputs robot actions
PVR: Pretrained Visual Representation—visual encoders (like CLIP or R3M) trained on large datasets to extract features for robot policies
World Model: A model that predicts future states of the environment given current states and actions, often used for planning or simulation
CoT: Chain-of-Thought—a reasoning technique where models generate intermediate reasoning steps before the final output
Affordance: The set of actions that are possible for a given object or environment state (e.g., a handle affords pulling)
Sim-to-Real: Transferring policies learned in simulation to physical robots, often requiring domain adaptation
Imitation Learning: Learning a policy by mimicking expert demonstrations rather than exploring via trial-and-error (RL)
Zero-shot Generalization: The ability to perform tasks or handle objects never seen during training