Evaluation Setup
Web navigation agents executing natural language instructions on live or simulated websites
Benchmarks:
- WebArena (Execution-based web navigation)
- Mind2Web (Broad coverage web navigation (Step-wise evaluation))
Metrics:
- Success Rate
- Step-wise Success Rate
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- AWM substantially outperforms baselines on both WebArena (+51.1% relative) and Mind2Web (+24.6% relative), demonstrating the value of abstracted workflow memory.
- Online AWM generalizes effectively to cross-task, cross-website, and cross-domain settings, improving over baselines by 8.9โ14.0 absolute points as distribution gaps widen.
- The method exhibits a 'snowball effect' in online settings, where learning simple tasks (e.g., finding a place) enables the solution of complex tasks (e.g., getting a zip code) later in the stream.
- AWM outperforms even methods augmented with human-written workflows (+7.9%), suggesting that model-induced workflows can be more effective or scalable than manual curation.