Set-of-Mark (SoM): A visual prompting technique where actionable objects in an image are overlaid with numeric labels or bounding boxes to help the model reference them
Trace-of-Mark (ToM): A temporal extension of SoM where the movement trajectories of marked objects are visualized across video frames, serving as a proxy for action planning
Vision-Language-Action (VLA): Models that integrate visual perception, language understanding, and action generation into a single system
7-DoF: 7 Degrees of Freedom—describing the movement capabilities of a robot arm (position x,y,z + rotation yaw,pitch,roll + gripper state)
CoTracker: A computer vision model used to track dense points across video frames, used here to generate ToM labels
ConvNeXt: A convolutional neural network architecture used here as the vision encoder for its ability to handle arbitrary resolutions
SOTA: State-of-the-Art—the current best performance on a specific benchmark