GST: Gaussian Spatial Tokenizer—a module that converts depth and visual features into 3D Gaussian primitives used as tokens
DA-CoT: Depth-Aware Chain-of-Thought—a supervised reasoning process where the model generates explicit spatial text (centroids, waypoints) before actions
Anisotropic Gaussian: A 3D shape defined by a mean and covariance matrix that can stretch in different directions, used here to model surface orientation
MoE: Mixture-of-Experts—a neural network architecture where different sub-networks (experts) specialize in different parts of the input space
Flow Matching: A generative modeling technique used here to predict continuous action trajectories by learning a velocity field
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights
SE(3): Special Euclidean group in 3 dimensions—representing rigid body motions (translation + rotation)
MIP: Multi-scale Image Pyramid—aggregating features from different spatial resolutions to capture context
Proprioceptive state: The robot's internal sense of its own joint positions and gripper status