VLA: Vision-Language-Action models that integrate visual perception and text understanding to directly output robotic control actions
BC: Behavior Cloning—training a model to directly mimic expert actions using supervised learning
CRL: Contrastive Reinforcement Learning—estimating future reward probabilities by pulling reachable state-goals together and pushing random goals apart in embedding space
GCRL: Goal-Conditioned Reinforcement Learning—a framework where an agent learns to achieve multiple specific goals dynamically rather than maximizing a single global reward
Discounted State Occupancy Measure: The probability distribution of states an agent will visit in the future, with states further in the future discounted exponentially
MoT: Mixture-of-Transformers—a dual-system architecture where one transformer handles high-level reasoning and another handles low-level control
FAST tokenization: A specific tokenization method to convert continuous robotic action chunks into discrete auto-regressive tokens for processing by a language model
DiT: Diffusion Transformer—a neural architecture used here as an action expert for generating continuous robotic trajectories
Flow-matching: A generative modeling technique used to predict continuous robotic action chunks by learning a vector field that transports a simple distribution to the target data distribution
CuTe-FlashAttention: A highly optimized custom GPU attention kernel developed to support structurally sparse role-aware masks efficiently without performance degradation
AR: Auto-Regressive—generating tokens one by one, where each new token depends on all previously generated tokens