Safe and Scalable Web Agent Learning via Recreated Websites

📝 Paper Summary

Web agents Synthetic environment generation Agentic reinforcement learning

VeriEnv uses coding agents to clone real websites into executable sandboxes, enabling web agents to learn from self-generated tasks with deterministic, database-backed verification instead of unreliable LLM judges.

Core Problem

Training web agents on real websites is unsafe, hard to reset, and lacks reliable rewards, often forcing reliance on error-prone LLM-as-a-judge evaluations.

Why it matters:

Real-world exploration is dangerous (spamming, payments) or blocked (CAPTCHAs), limiting the scale of agent training data
LLM-based evaluation is heuristic and unreliable, leading to unstable learning signals compared to deterministic code-based verification
Existing benchmarks are static and finite; agents need a scalable way to self-evolve on diverse, continuously expanding tasks

Concrete Example: A task asking to 'sort apartments by price' might be judged correct by an LLM if the visual output looks sorted, even if the underlying logic failed. In VeriEnv, a Python script queries the simulated database to mathematically verify the sort order.

Key Novelty

VeriEnv (Verifiable Environments via Website Cloning)

Treats the environment itself as code: A coding agent (GPT-5.2) clones a target website's frontend, backend, and database from screenshots into a local executable
Generates tasks paired with executable validation programs (using a custom Python SDK) that inspect the internal database state for deterministic pass/fail rewards

Architecture

The complete VeriEnv framework workflow: (1) Cloning real websites into synthetic environments, (2) Generating tasks with executable validators, and (3) Training agents via verifiable feedback.

Evaluation Highlights

+9.09% success rate improvement on WebArena-Lite using LLaMA-3.2-3B-Instruct trained with VeriEnv compared to the base model
+6.06% success rate improvement on WebArena-Lite using Qwen3-4B trained with VeriEnv compared to the base model
Functionality ratings of 90.3% and visual quality of 4.7/5 for the synthetically recreated websites based on human evaluation

Breakthrough Assessment

8/10

Significant step forward in agent safety and scalability. Moving from 'simulators' to 'automatically cloned real-world replicas' with verifiable internals solves a major bottleneck in web agent RL.

⚙️ Technical Details

Problem Definition

Setting: Web agent training via self-evolution in synthetic environments

Inputs: Screenshots of a real-world website E

Outputs: An executable synthetic environment E_tilde = (Code, Database, PythonSDK) and a trained web agent policy

Pipeline Flow

Environment Construction: Coding Agent + Screenshots -> Synthetic Website (Code + DB + SDK)
Task Generation: LLM -> Task Description + Python Validation Program
Agent Training: Agent -> Trajectory -> Python Validator (via SDK) -> Deterministic Reward

System Modules

Environment Creator

Clones the target website into a functional local version

Model or implementation: GPT-5.2 (OpenAI)

Task & Judge Generator

Creates training tasks and their corresponding executable verifiers

Model or implementation: GPT-5.2

Web Agent

Performs the web task within the synthetic environment

Model or implementation: Qwen3-4B or LLaMA-3.2-3B-Instruct

Novel Architectural Elements

Automatic generation of a Python SDK (P) alongside the website application code to expose internal database states for external verification
Iterative self-debugging loop where the coding agent uses Playwright to visually inspect the cloned site and patch bugs before agent training begins

Modeling

Base Model: Qwen3-4B and LLaMA-3.2-3B-Instruct (for the Web Agents); GPT-5.2 (for Environment Creation)

Training Method: Rejection Fine-Tuning (RFT)

Objective Functions:

Purpose: Filter agent trajectories for quality.

Formally: Verify trajectory tau using executable validator V(tau, D) -> {0, 1}.
Purpose: Fine-tune policy on successful trajectories.

Formally: Standard Supervised Fine-Tuning (SFT) loss on filtered data.

Adaptation: Full fine-tuning

Training Data:

149 recreated websites
7,400 generated tasks
Trajectories filtered via executable verification

Key Hyperparameters:

learning_rate: 1e-5
epochs: 2
lr_scheduler: linear warmup (first 10%)
+ 2 more
max_sequence_length: 8000
gradient_accumulation_steps: 2

Compute: 2x NVIDIA A40 GPUs for agent training. Website cloning takes ~83.5 mins per site.

Comparison to Prior Work

vs. PAE: VeriEnv uses deterministic code-based verification (SDK checking DB) rather than unreliable LLM judges [cited in paper]
vs. Synatra/ADP: VeriEnv trains on fully interactive, self-generated trajectories in cloned environments rather than static datasets or tutorial following [cited in paper]
vs. WebArena/Mind2Web (Benchmarks): VeriEnv is a *generator* of environments, not just a static set of tasks [cited in paper]
+ 1 more
vs. WebVoyager [not cited in paper]: WebVoyager uses text-based interaction on real sites; VeriEnv clones the site to allow safe, deep-state verification

Limitations

Dependency on proprietary heavy-duty models (GPT-5.2) for the environment creation step
Difficulty cloning complex multimedia sites (e.g., YouTube video streams, PDF rendering on arXiv)
Infrastructure challenges (e.g., port conflicts) when deploying hundreds of generated apps simultaneously
Judge correctness (76%) is lower than task executability (90%), primarily due to database reset issues

Reproducibility

Code: https://github.com/kyle8581/VeriEnv

Code and resources to be released at https://github.com/kyle8581/VeriEnv. Coding agent prompts provided in Appendix. Implementation relies on GPT-5.2 (proprietary) and Cursor CLI.

📊 Experiments & Results

Evaluation Setup

Cross-domain generalization (training on recreated sites, testing on established benchmarks) and Site-specific mastery.

Benchmarks:

WebArena-Lite (Realistic web tasks (hosting, shopping, etc.))
Mind2Web-Online (Generalizable web interaction across 100+ sites)

Metrics:

Success Rate (SR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
VeriEnv improves success rates on the WebArena-Lite benchmark across different base models, outperforming baselines that use aggregated data (ADP) or static datasets.
WebArena-Lite	Success Rate	12.12	21.21	+9.09
WebArena-Lite	Success Rate	19.70	25.76	+6.06
Mind2Web-Online	Success Rate	19.50	20.90	+1.40

Experiment Figures

Comparison of site-specific mastery (Success Rate) over training iterations between VeriEnv (Verifiable) and PAE (LLM-Judge) on three website categories.

Scaling trends: Agent Success Rate vs. Number of Training Environments.

Main Takeaways

Agents trained with VeriEnv generalize effectively to unseen websites (WebArena), surpassing models trained on static datasets.
Verifiable rewards (code-based judges) lead to more stable self-evolving learning compared to LLM-based judges (PAE), which plateau due to false positives.
Scaling the number of recreated training environments linearly improves agent performance, suggesting a scalable path forward.
Coding agents can successfully clone ~80% of target websites with high functional fidelity, enabling 'Environment-as-Code' for training.

📚 Prerequisite Knowledge

Prerequisites

Web development stacks (Frontend/Backend/Database)
Reinforcement Learning (Reward signals, Trajectories)
Language Models as Agents (Tool use, Coding)

Key Terms

VeriEnv: The proposed framework that clones websites into sandboxed environments to train agents with verifiable rewards

Playwright: A framework for Web Testing and Automation that allows code to control a browser programmatically

MCP: Model Context Protocol—a standard way for AI models to interact with external tools and data contexts

Rejection Fine-Tuning: A training method where the model generates multiple trajectories, and only those that pass a verification check are used for supervised fine-tuning

SDK: Software Development Kit—in this paper, a generated Python interface that allows the validator to query the synthetic website's database

WebArena: A realistic web agent benchmark environment requiring agents to perform tasks across simulated websites

Mind2Web: A dataset and benchmark for developing generalist web agents across many different domains

LLM-as-a-Judge: Using a Large Language Model to evaluate the output or behavior of another model, often used when ground truth is hard to define programmatically