VLA: Vision-Language-Action models—systems that take vision and language inputs and directly output robot actions
Diffusion Policy: A robot control policy that generates actions by denoising random noise, conditioned on observations
Flow Matching: A generative modeling technique used here to train the diffusion policy to predict trajectory velocities
KV-cache: Key-Value cache—a memory optimization technique in Transformers to speed up inference by reusing previously computed attention representations
Pixel Goal Grounding: Identifying a specific 2D point in an image that corresponds to a navigational target (e.g., 'the door')
Social-VLN: A new benchmark proposed in this paper that introduces dynamic humanoid agents into VLN environments to test obstacle avoidance
DiT: Diffusion Transformer—a neural network architecture that uses Transformer blocks within a diffusion generation process
Q-Former: A module (from BLIP-2) that bridges a frozen image encoder and a language model by extracting a fixed number of visual tokens