GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled trajectories rather than using a learned value function
POMDP: Partially Observable Markov Decision Process—a decision-making framework where the agent cannot see the entire state of the environment (e.g., hidden backend state of a website)
Behavior Cloning (BC): A supervised learning approach where an agent learns to mimic expert actions from a dataset of demonstrations
M-GRPO: Multi-turn Group Relative Policy Optimization—the paper's extension of GRPO to handle sequential decisions over multiple turns in an environment
CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer
Context Compression: Reducing the length of past inputs (e.g., HTML pages) in the prompt to save memory while retaining essential history
SFT: Supervised Fine-Tuning—training a model on labeled examples (here, expert trajectories) before applying RL
WebArena-Lite: A curated, human-verified subset of the WebArena benchmark, designed for more reliable evaluation of web agents