_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
VLM: Vision-Language Model—a model that processes both image and text inputs to generate text or actions
AWR: Advantage-Weighted Regression—an off-policy RL algorithm that updates the policy by regressing on actions with high estimated advantages
AitW: Android-in-the-Wild—a dataset and benchmark for Android device control tasks
Doubly-Robust Estimator: A statistical technique for estimating advantages that combines Monte-Carlo returns (low bias, high variance) with value function estimates (high bias, low variance) to reduce overall error
Instruction-level Value Function: A learned function that predicts the expected success rate of a specific instruction, used to prioritize harder or more informative tasks (curriculum learning)
Step-level Value Function: A learned function that predicts the expected future reward from a specific state, used to compute advantages for specific actions
SFT: Supervised Fine-Tuning—training a model on a fixed dataset of expert demonstrations
GUI: Graphical User Interface—the visual interface of a device (icons, buttons, windows) that the agent interacts with
Filtered Behavior Cloning: An imitation learning approach where the agent clones only the successful trajectories from its own past experiences
CogAgent: A large vision-language model specifically designed for GUI agents
AppAgent: A prior agent framework that uses LLMs/VLMs (like GPT-4V) to control apps via simplified action spaces
Gemini 1.5 Pro: A large proprietary multimodal model by Google
Auto-Curriculum: A mechanism to automatically select which tasks to train on based on their estimated learning value or difficulty