← Back to Paper List

Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems

Tamer Abuelsaad, Deepak Akkil, Prasenjit Dey, Ashish Jagmohan, Aditya Vempaty, Ravi Kokku
IBM Research
arXiv (2024)
Agent Benchmark

πŸ“ Paper Summary

Web Agents Autonomous Web Navigation Hierarchical Agent Architecture
Agent-E utilizes a hierarchical architecture separating high-level planning from low-level browser execution, combined with flexible DOM distillation and change-observation feedback, to achieve state-of-the-art web navigation performance.
Core Problem
Web agents struggle with expansive/noisy DOMs that exceed context windows, complex UI patterns designed for humans, and the need for robust multi-step planning and error recovery.
Why it matters:
  • Current agents have low success rates on complex real-world tasks compared to humans, limiting mainstream adoption
  • Existing single-agent architectures often get overwhelmed by page details or fail to recover from minor execution errors
  • Cost and latency (task completion time) are critical but under-reported metrics for practical agent deployment
Concrete Example: When booking a flight, a standard agent might fail to realize a date selector reset to default because of an invalid format. Without 'change observation,' it proceeds blindly, booking the wrong date. Agent-E detects the state change (or lack thereof) and corrects itself.
Key Novelty
Hierarchical Planner-Navigator Architecture with Flexible Sensing
  • Separates concerns: A 'Planner' handles high-level task decomposition and verification, while a 'Browser Navigation Agent' handles low-level DOM interaction and execution
  • Flexible DOM Distillation: The agent autonomously selects the best DOM representation (text-only, input fields, or all fields) based on the specific sub-task requirements
  • Change Observation: Actions return not just success/failure, but a description of the resulting state change (e.g., 'popup appeared'), acting as a feedback mechanism similar to Reflexion
Architecture
Architecture Figure Figure 2
High-level architecture showing the separation between the Planner Agent and Browser Navigation Agent, and their respective skill executors.
Evaluation Highlights
  • Achieves 73.2% success rate on WebVoyager benchmark, outperforming previous text-only SOTA (Wilbur) by +20.5% and multi-modal SOTA by +16.0%
  • Demonstrates high self-awareness, correctly identifying technical failures in over 52% of failed tasks rather than hallucinating success
  • Achieves 95.7% success rate on WolframAlpha, a +30.5% improvement over the previous multi-modal SOTA
Breakthrough Assessment
8/10
Significant jump in success rates on a rigorous benchmark (WebVoyager) through architectural improvements (hierarchy + sensing) rather than just better base models. Establishes new comprehensive metrics like error-awareness.
×