← Back to Paper List

Enhancing Web Agents with a Hierarchical Memory Tree

Yunteng Tan, Zhi Gao, Xinxiao Wu
Beijing Institute of Technology
arXiv (2026)
Memory Agent Reasoning

📝 Paper Summary

Memory organization Web agents
HMT structures web agent memory into a hierarchy of intents, stages, and actions to decouple transferable planning logic from site-specific execution details, preventing failures on unseen websites.
Core Problem
Retrieval-based web agents typically use flat memory structures that entangle high-level task logic with site-specific action details (like element IDs), causing failures when transferred to new websites.
Why it matters:
  • Current agents fail to generalize across websites because they try to execute actions grounded in the specific HTML structure of previous sites
  • Flat memory retrieval leads to 'intention-execution entanglement,' where correct high-level intents are paired with invalid low-level execution parameters
  • Workflow mismatch occurs when agents retrieve actions that are functionally correct for the task but sequentially invalid for the current page state
Concrete Example: An agent retrieving a memory to 'click search' might attempt to click a button with ID '#btn-123' from a previous site. On a new site where that ID doesn't exist, the agent fails, even though the intent to 'click search' is correct.
Key Novelty
Hierarchical Memory Tree (HMT)
  • Deconstructs interaction trajectories into three levels: Intent (user goals), Stage (semantic subgoals with pre/post-conditions), and Action (abstract patterns without raw IDs)
  • Replaces site-specific element identifiers (e.g., DOM IDs) with 'semantic element descriptions' (e.g., 'button labeled Search') to allow grounding on new page layouts
  • Uses a 'Planner-Actor' inference scheme where the Planner verifies visual pre-conditions to match the logical stage, and the Actor grounds abstract descriptions to local elements
Evaluation Highlights
  • +9.4% improvement in Task Success Rate on Mind2Web Cross-Website split compared to AWM (online), showing strong generalization to unseen sites
  • +6.6% improvement in Total Success Rate on WebArena compared to a Flat Retrieval baseline, effectively mitigating intention-execution entanglement
  • Outperforms state-of-the-art AWM agent by 3.2% on WebArena, with significant gains in the 'Maps' (+10.4%) and 'GitLab' (+5.8%) domains
Breakthrough Assessment
8/10
Addresses a critical bottleneck in web agent generalization (ID dependency) with a logically sound hierarchical abstraction. Significant empirical gains on major benchmarks confirm the validity of decoupling logic from execution.
×