AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

📝 Paper Summary

Synthetic Data Generation GUI Agents Web Navigation

AgentTrek synthesizes high-quality training data for web agents by harvesting text tutorials from the internet and using a VLM to execute them in real browsers, capturing grounded trajectories at low cost.

Core Problem

Training effective GUI agents requires large-scale, high-quality trajectory data (sequences of goals, observations, reasoning, and actions), which is scarce on the web.

Why it matters:

Standard LLM training data lacks the step-by-step visual and interactive grounding needed for complex web tasks
Existing datasets rely on human annotation, which is slow, unscalable, and prohibitively expensive for large datasets
Without diverse trajectory data, agents struggle with long-horizon planning and precise element interaction

Concrete Example: A human-annotated dataset might cost tens of dollars per task. AgentTrek generates a trajectory for finding a specific return policy by having GPT-4o read a 'How to find return policy' tutorial and actually click through the live website, recording the data for $0.55.

Key Novelty

Guided Replay of Web Tutorials

Instead of random exploration or human demonstration, the system harvests existing 'how-to' tutorials from the web to serve as high-level scripts
A capable VLM agent 'replays' these tutorials in a live browser, effectively converting static text instructions into dynamic, grounded interaction traces (screenshots, DOM trees, actions)
Self-correction and evaluation are built-in: a separate VLM evaluator verifies if the agent's replay successfully achieved the tutorial's goal before adding it to the dataset

Architecture

The complete AgentTrek pipeline from data collection to model training.

Evaluation Highlights

Achieves a cost of $0.551 per high-quality trajectory, significantly cheaper than human annotation
+9.3% task success rate on WebArena (22.40% vs 13.10%) for Qwen2.5-32B trained on AgentTrek data compared to GPT-4o
+36.7 point improvement on ScreenSpot average accuracy (67.4% vs 30.7%) for Qwen2-VL-7B fine-tuned on AgentTrek data

Breakthrough Assessment

8/10

Provides a scalable, economically viable solution to the data bottleneck for GUI agents. The method effectively leverages existing web knowledge (tutorials) to generate grounded action data, yielding SOTA results on major benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Generating a dataset of trajectories T = {(d, o_1, r_1, a_1, ...)} where d is the task description, o are observations, r are reasoning steps, and a are actions.

Inputs: Raw web corpus (RedPajama)

Outputs: Verified agent trajectories containing screenshots, accessibility trees, and action sequences

Pipeline Flow

Group 1: Automatic Tutorial Collection (Raw Text → Structured Tutorial)
Group 2: Guided Replay (Structured Tutorial → Raw Trajectory)
Group 3: Evaluation & Filtering (Raw Trajectory → Verified Dataset)

System Modules

Pre-filter & Labeler (Automatic Tutorial Collection)

Identify potential tutorial texts from massive web corpora

Model or implementation: FastText classifier trained on GPT-4o-mini labeled data

Standardizer (Automatic Tutorial Collection)

Convert raw tutorial text into a structured JSON format with steps and prerequisites

Model or implementation: GPT-4o-mini

Replay Agent

Execute the structured tutorial on the live web to generate observations and actions

Model or implementation: GPT-4o (VLM agent)

Evaluator

Verify if the generated trajectory successfully achieved the goal

Model or implementation: GPT-4o

Novel Architectural Elements

Guided Replay Pipeline: A feedback loop where synthesized tutorials drive a VLM to generate ground-truth interaction data, which is then self-evaluated

Modeling

Base Model: Qwen2.5-7B/32B (Text-based) and Qwen2-VL-7B (Vision-based)

Training Method: Supervised Fine-Tuning (SFT)

Training Data:

10,000 trajectories for vision-based agents (Screenshots + Actions)
6,000 trajectories for text-based agents (AXTree + Actions)

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
epochs: Not reported in the paper

Compute: Cost of data generation: $219.35 per 1,000 trajectories

Comparison to Prior Work

vs. Mind2Web: AgentTrek is fully automated and multimodal (includes video/screenshots) vs. human-annotated
vs. Synatra: AgentTrek uses 'Guided Replay' on live websites to ensure grounding vs. Synatra's simulation/rewriting approach
vs. BAGEL [not cited in paper]: AgentTrek uses external web tutorials as the exploration guide rather than self-exploration or synthesis from scratch

Limitations

Dependency on the availability and quality of existing web tutorials
Dynamic websites may change, rendering static tutorials obsolete during replay
Limited task coverage (currently 11 categories) compared to the full breadth of the web
Relying on GPT-4o for evaluation may introduce bias or overlook subtle failures

Reproducibility

Code: https://agenttrek.github.io

Project page (https://agenttrek.github.io) is available. Dataset and fine-tuning scripts are promised. Detailed cost breakdown provided. The method relies on GPT-4o for generation and evaluation (closed source dependency).

📊 Experiments & Results

Evaluation Setup

Offline training on synthesized data, followed by evaluation on diverse web agent benchmarks

Benchmarks:

WebArena (Realistic web browsing tasks (text-based evaluation))
ScreenSpot (GUI Visual Grounding (identifying element coordinates))
Multimodal-Mind2Web (Generalist web agent tasks (Cross-Task, Cross-Website, Cross-Domain))

Metrics:

Success Rate (SR)
Element Accuracy (Ele.Acc)
Operation F1 (Op.F1)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
WebArena results demonstrate that models trained on AgentTrek data significantly outperform baselines, including much larger closed-source models.
WebArena	Success Rate	13.10	22.40	+9.30
WebArena	Success Rate	3.80	10.46	+6.66
ScreenSpot results show massive improvements in visual grounding capabilities.
ScreenSpot (Web)	Average Accuracy	30.7	67.4	+36.7
Mind2Web results highlight generalization across domains and the complementary nature of synthetic data.
Multimodal-Mind2Web (Cross-Domain)	Step Success Rate	47.7	52.6	+4.9

Experiment Figures

Scaling performance comparison between synthesized AgentTrek data and human-labeled Mind2Web data.

Main Takeaways

AgentTrek data is highly effective for both text-based and vision-based agents, bridging the gap between LLM pre-training and GUI execution.
Scaling the amount of synthetic data leads to steady improvements in cross-domain generalization (e.g., Cross-Domain Step SR improves from 39.5% with 20% data to 45.0% with 100% data).
The approach is extremely cost-effective ($0.55 per trajectory), offering a scalable alternative to human annotation.
Synthetic data complements human data: combining AgentTrek with Mind2Web yields the best performance, suggesting they cover different aspects of agent behavior.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Web Agents (actions like click, type)
Familiarity with HTML/DOM and Accessibility Trees
Basics of Vision-Language Models (VLMs)
Supervised Fine-Tuning (SFT)

Key Terms

GUI: Graphical User Interface—the visual part of a website or app that users interact with

Trajectory: A recorded sequence of an agent's interactions, including observations (screenshots), internal thoughts, and actions taken to solve a task

DOM: Document Object Model—the code structure representing a webpage's content

AXTree: Accessibility Tree—a simplified version of the DOM used by screen readers (and agents) that focuses on interactive elements and their semantic roles

VLM: Vision-Language Model—an AI model capable of processing both images (screenshots) and text

Playwright: A software library used to automate web browsers, allowing the agent to programmatically control the browser

SFT: Supervised Fine-Tuning—training a model on a labeled dataset to improve its performance on specific tasks

FastText: A library for efficient text classification and representation learning

RedPajama: A large-scale open-source dataset of text collected from the internet, used here as the source for tutorials

Grounding: The ability of an agent to link abstract concepts (e.g., 'search button') to specific pixel coordinates or code elements on a screen