VLA: Vision-Language-Action model—a VLM fine-tuned to output robot actions instead of just text.
Action Chunking: Predicting a sequence of k future actions at once rather than just the next immediate action, used to improve temporal consistency and handle latency.
Autoregressive decoding: Generating output tokens one by one, where each token depends on the previous ones (slow).
Parallel decoding: Generating all output tokens for a sequence simultaneously in one forward pass (fast).
FiLM: Feature-wise Linear Modulation—a technique to condition a neural network by scaling and shifting its features based on an external input (here, language instructions).
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main model weights and trains small rank-decomposition matrices.
L1 Regression: A loss function that minimizes the absolute difference between predicted and ground-truth values (Mean Absolute Error).
Diffusion Policy: A policy class that generates actions by gradually denoising random noise, often used for modeling multimodal action distributions.