GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of sampled outputs against their average, removing the need for a separate critic network
Inverse Dynamics: The task of inferring the action that caused a transition between two observed states (e.g., State A -> Action? -> State B)
SFT: Supervised Fine-Tuning—training a model to mimic a reference dataset of inputs and outputs using cross-entropy loss
VLM: Vision-Language Model—a neural network capable of processing and reasoning about both images (screenshots) and text
Bounding Box: A rectangular area defined by coordinates [x1, y1, x2, y2] that encloses a GUI element
Grounding: The ability of an AI agent to identify and locate specific UI elements on a screen based on a description or goal
K-step GUI Transition: The proposed self-supervised task where the model predicts the initial action required to transition from a start screen to a screen K steps later in a recorded trajectory