_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
MBRL: Model-Based Reinforcement Learning—an RL approach where the agent learns a model of the environment's dynamics and uses it to simulate experience
World Model: A learned simulator that predicts the next state of the environment given the current state and an action
Imagined Rollout: A trajectory of states and actions generated entirely by interacting with the world model rather than the real environment
Accessibility Tree: A structured, text-based representation of a web page's user interface elements, used as the observation space for the agent
GSPO: Group Sequence Policy Optimization—an RL algorithm that updates the policy based on the likelihood of entire trajectories (sequences) rather than individual tokens
SFT: Supervised Fine-Tuning—training a model on a dataset of expert demonstrations before applying reinforcement learning
POMDP: Partially Observable Markov Decision Process—a mathematical framework for modeling decision-making where the agent cannot directly observe the full state of the environment
Chain-of-Thought: A prompting technique where the model generates intermediate reasoning steps before producing the final answer or action
WebArena: A benchmark for evaluating web agents on realistic, self-hosted websites
WebVoyager: A benchmark for evaluating web agents on live websites via multimodal interaction