← Back to Paper List

AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, Tao Yu
University of Hong Kong, Salesforce Research
International Conference on Learning Representations (2024)
Agent MM Benchmark

📝 Paper Summary

Synthetic Data Generation GUI Agents Web Navigation
AgentTrek synthesizes high-quality training data for web agents by harvesting text tutorials from the internet and using a VLM to execute them in real browsers, capturing grounded trajectories at low cost.
Core Problem
Training effective GUI agents requires large-scale, high-quality trajectory data (sequences of goals, observations, reasoning, and actions), which is scarce on the web.
Why it matters:
  • Standard LLM training data lacks the step-by-step visual and interactive grounding needed for complex web tasks
  • Existing datasets rely on human annotation, which is slow, unscalable, and prohibitively expensive for large datasets
  • Without diverse trajectory data, agents struggle with long-horizon planning and precise element interaction
Concrete Example: A human-annotated dataset might cost tens of dollars per task. AgentTrek generates a trajectory for finding a specific return policy by having GPT-4o read a 'How to find return policy' tutorial and actually click through the live website, recording the data for $0.55.
Key Novelty
Guided Replay of Web Tutorials
  • Instead of random exploration or human demonstration, the system harvests existing 'how-to' tutorials from the web to serve as high-level scripts
  • A capable VLM agent 'replays' these tutorials in a live browser, effectively converting static text instructions into dynamic, grounded interaction traces (screenshots, DOM trees, actions)
  • Self-correction and evaluation are built-in: a separate VLM evaluator verifies if the agent's replay successfully achieved the tutorial's goal before adding it to the dataset
Architecture
Architecture Figure Figure 2
The complete AgentTrek pipeline from data collection to model training.
Evaluation Highlights
  • Achieves a cost of $0.551 per high-quality trajectory, significantly cheaper than human annotation
  • +9.3% task success rate on WebArena (22.40% vs 13.10%) for Qwen2.5-32B trained on AgentTrek data compared to GPT-4o
  • +36.7 point improvement on ScreenSpot average accuracy (67.4% vs 30.7%) for Qwen2-VL-7B fine-tuned on AgentTrek data
Breakthrough Assessment
8/10
Provides a scalable, economically viable solution to the data bottleneck for GUI agents. The method effectively leverages existing web knowledge (tutorials) to generate grounded action data, yielding SOTA results on major benchmarks.
×