GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a trajectory's reward to the average reward of a group of trajectories for the same input
SFT: Supervised Fine-Tuning—training a model on labeled examples to provide a warm start before RL
AdaGRPO: Difficulty-Adaptive Group Relative Policy Optimization—the proposed algorithm extending GRPO with difficulty-aware sampling and reward shaping
AdaPR: Difficulty-Adaptive Positive Replay—a mechanism to store and reuse rare successful trajectories from hard tasks to improve sample efficiency
FCF: Failure Curriculum Filtering—a strategy to temporarily stop sampling tasks that have consistently failed, saving compute
SPA: Shortest-Path Reward Adjustment—modifying the binary success reward to favor shorter trajectories, penalizing inefficient paths
VLM: Vision Language Model—a multimodal model capable of processing both text and images (screenshots)
AVD: Android Virtual Device—a software emulator for the Android operating system
KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution