VLA: Vision-Language-Action models—systems that take visual and text inputs and directly output robotic control actions
Non-Markovian: Processes where the next state depends on the history of events, not just the current state
PCMB: Perceptual-Cognitive Memory Bank—the proposed module storing history in two streams (visual details and semantic gist)
Working Memory: In this paper, the representation of the current timestep (perceptual + cognitive tokens) used to query long-term history
DiT: Diffusion Transformer—a diffusion model architecture based on Transformers instead of U-Nets
DDIM: Denoising Diffusion Implicit Models—an efficient sampling algorithm for diffusion models
7-DoF: 7 Degrees of Freedom—robot control outputs comprising 3 translation, 3 rotation, and 1 gripper state
SigLIP: Sigmoid Loss for Language Image Pre-training—a contrastive vision-language model used here as a visual encoder backbone
SimplerEnv: A simulation environment for evaluating robotic manipulation policies (Bridge and Fractal suites)