VLA: Vision-Language-Action policies—models that process images and language instructions to directly output robot actions
VLM: Vision-Language Model—a neural network backbone that processes both visual and textual inputs
decision chunks: Also known as 'chunked control', a method where the policy predicts a sequence of multiple future actions at once rather than just a single next action
flow-matching: A generative modeling framework similar to diffusion that learns a continuous vector field to transform a simple noise distribution into a target data distribution
co-denoising: Simultaneously refining both the predicted action sequence and the anticipated scene representation within the same generative flow-matching process
geometric prior: A learned representation that encodes the 3D structure and state of the environment, carried forward across time steps
3DFM: 3D Foundation Model—a pre-trained neural network that extracts rich, metric 3D features from multi-view images
RoboTwin: A simulation benchmark containing various dual-arm robot manipulation tasks