Native GUI Agents: End-to-end models that map instructions and screenshots directly to executable actions without external planners
Partial Verifiability: A characteristic of GUI tasks where multiple valid actions exist for a state, but offline data verifies only one, causing ambiguity in reward assignment
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a sampled group of outputs to reduce variance
KL Regularization: A penalty term enforcing the trained policy to stay close to a reference policy (usually the SFT model) to prevent mode collapse or drift
Action-aware SFT (ASFT): A fine-tuning strategy that assigns higher loss weights to action/grounding tokens and lower weights to reasoning tokens to preserve execution precision
Grounding: The ability of the model to map semantic intent (e.g., 'click the button') to precise screen coordinates
Chain-of-Thought (CoT): Intermediate reasoning steps generated by the model before the final action
Success-Adaptive Scaling: A technique to downweight the learning signal from negative samples in RL when the reward signal is unreliable (ambiguous)