VLA: Vision-Language-Action models—systems that take images and text as input and directly output robot actions
ARM: Action Reasoning Models—the authors' proposed class of models that integrate perception, planning, and control in a structured pipeline
Visual Reasoning Trace: A 2D polyline generated by the model (or drawn by a user) on the input image, representing the planned path of the robot's end-effector
BPE: Byte-Pair Encoding—a tokenization method used in LLMs; here adapted to map continuous action values to text tokens
SimplerEnv: A simulation benchmark for evaluating robotic manipulation policies
LIBERO: A benchmark for lifelong robot learning, testing generalization and long-horizon task performance
Depth Perception Tokens: Discrete tokens representing quantized depth information, distilled from a specialist depth model
CoT: Chain-of-Thought—a reasoning technique where models generate intermediate steps; MolmoAct uses 'spatial' CoT (depth -> trace -> action)
SigLIP: A vision encoder model (Sigmoid Loss for Language Image Pre-training) used as a backbone