VeriEnv: The proposed framework that clones websites into sandboxed environments to train agents with verifiable rewards
Playwright: A framework for Web Testing and Automation that allows code to control a browser programmatically
MCP: Model Context Protocol—a standard way for AI models to interact with external tools and data contexts
Rejection Fine-Tuning: A training method where the model generates multiple trajectories, and only those that pass a verification check are used for supervised fine-tuning
SDK: Software Development Kit—in this paper, a generated Python interface that allows the validator to query the synthetic website's database
WebArena: A realistic web agent benchmark environment requiring agents to perform tasks across simulated websites
Mind2Web: A dataset and benchmark for developing generalist web agents across many different domains
LLM-as-a-Judge: Using a Large Language Model to evaluate the output or behavior of another model, often used when ground truth is hard to define programmatically