VLA: Vision-Language-Action models—foundation models that map visual and language inputs directly to robotic actions
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a sampled group of trajectories to stabilize training
World Model: A learned predictive model that simulates the environment's dynamics (next states/frames) given current states and actions
Action Chunk: A sequence of predicted actions executed in succession, rather than a single step, used to handle temporal dependencies
AdaLN: Adaptive Layer Normalization—a technique to modulate layer normalization parameters based on conditioning inputs (like time or action)
SDXL: Stable Diffusion XL—a large-scale text-to-image diffusion model whose VAE component is used here for high-fidelity image compression
VideoMAE: Video Masked Autoencoder—a video understanding model used here as a reward classifier to judge task success
OXE: Open X-Embodiment—a large-scale dataset of robotic trajectories used for pretraining