SAM2: Segment Anything Model 2—a computer vision foundation model designed for segmenting and tracking objects in images and video
RVT: Robotic View Transformer—a baseline architecture that uses multi-view 2D renderings of 3D point clouds to predict robot actions
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights
Markov Assumption: The assumption that the current state contains all necessary information to decide the next action (i.e., history doesn't matter)
POMDP: Partially Observable Markov Decision Process—a decision-making framework where the agent cannot see the full state of the world and must rely on memory or beliefs
6-DoF: Six Degrees of Freedom—referring to movement in 3D space (x, y, z translation) and orientation (roll, pitch, yaw)
Behavior Cloning: A supervised learning approach where the robot learns to mimic expert demonstrations provided in a dataset
MVT: Multi-View Transformer—the core backbone of the RVT architecture that processes images from multiple virtual camera views