VLA: Vision-Language-Action models—foundation models that take visual and text inputs to generate robotic actions
TPO: Trajectory-wise Preference Optimization—an adaptation of DPO for robotics that aligns policies using preferences over full action trajectories
PPO: Proximal Policy Optimization—an online reinforcement learning algorithm that updates policies using a clipped objective for stability
StARe: Stage-Aware Reinforcement—the proposed module that segments trajectories and calculates stage-specific rewards/costs
SFT: Supervised Fine-Tuning—training the model to mimic expert demonstrations via behavioral cloning
IPI: Imitation→Preference→Interaction—the proposed three-stage fine-tuning pipeline (SFT → StA-TPO → StA-PPO)
Credit Assignment: The problem of determining which past actions contributed to a final outcome (reward or failure)
End-effector: The device at the end of a robotic arm, such as a gripper or hand, used to interact with the environment