GUI: Graphical User Interface—visual interface enabling user interaction via icons and menus
VLM: Vision-Language Model—AI models capable of processing both images (screenshots) and text
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of samples to estimate advantages without a separate value network
SFT: Supervised Fine-Tuning—training models on labeled datasets before RL adaptation
ADB: Android Debug Bridge—command-line tool for communicating with Android devices
pass@k: A metric estimating the probability of solving a task at least once in k attempts
DPO: Direct Preference Optimization—an algorithm for aligning language models using preference pairs
k2-estimator: A specific variance reduction technique for KL divergence estimation, implemented here as Mean Squared Error (MSE)
hallucination: When a model generates incorrect or non-existent information, a key challenge in VLM-based reward estimation
test-time training: Updating the model parameters during the evaluation phase using rewards estimated from the test inputs themselves