VLA: Vision-Language-Action model—a single model that takes vision and language inputs and outputs robot actions.
SFT: Supervised Fine-Tuning—training a model on a smaller, task-specific dataset using ground-truth labels (expert demonstrations).
RLOO: REINFORCE Leave-One-Out—an advantage estimation technique that compares the reward of one trajectory against the average of others with the same start state.
PPO: Proximal Policy Optimization—an RL algorithm that updates policies conservatively to prevent performance collapse.
LOOP: Leave-One-Out PPO—a framework combining RLOO advantage estimation with PPO updates to enable stable RL without a learned critic network.
critic-free: An RL approach that does not train a separate neural network (critic) to estimate value functions, simplifying the training process.
sparse binary reward: Feedback that is only given at the end of a task (success=1, failure=0), without intermediate guidance.
OpenVLA: A specific open-source Vision-Language-Action model architecture.
QueST: A lightweight VLA model architecture.