Enhancing Web Agents with a Hierarchical Memory Tree

📝 Paper Summary

Memory organization Web agents

HMT structures web agent memory into a hierarchy of intents, stages, and actions to decouple transferable planning logic from site-specific execution details, preventing failures on unseen websites.

Core Problem

Retrieval-based web agents typically use flat memory structures that entangle high-level task logic with site-specific action details (like element IDs), causing failures when transferred to new websites.

Why it matters:

Current agents fail to generalize across websites because they try to execute actions grounded in the specific HTML structure of previous sites
Flat memory retrieval leads to 'intention-execution entanglement,' where correct high-level intents are paired with invalid low-level execution parameters
Workflow mismatch occurs when agents retrieve actions that are functionally correct for the task but sequentially invalid for the current page state

Concrete Example: An agent retrieving a memory to 'click search' might attempt to click a button with ID '#btn-123' from a previous site. On a new site where that ID doesn't exist, the agent fails, even though the intent to 'click search' is correct.

Key Novelty

Hierarchical Memory Tree (HMT)

Deconstructs interaction trajectories into three levels: Intent (user goals), Stage (semantic subgoals with pre/post-conditions), and Action (abstract patterns without raw IDs)
Replaces site-specific element identifiers (e.g., DOM IDs) with 'semantic element descriptions' (e.g., 'button labeled Search') to allow grounding on new page layouts
Uses a 'Planner-Actor' inference scheme where the Planner verifies visual pre-conditions to match the logical stage, and the Actor grounds abstract descriptions to local elements

Architecture

Overview of the HMT framework, showing the offline memory construction pipeline (left) and the online stage-aware inference process (right).

Evaluation Highlights

+9.4% improvement in Task Success Rate on Mind2Web Cross-Website split compared to AWM (online), showing strong generalization to unseen sites
+6.6% improvement in Total Success Rate on WebArena compared to a Flat Retrieval baseline, effectively mitigating intention-execution entanglement
Outperforms state-of-the-art AWM agent by 3.2% on WebArena, with significant gains in the 'Maps' (+10.4%) and 'GitLab' (+5.8%) domains

Breakthrough Assessment

8/10

Addresses a critical bottleneck in web agent generalization (ID dependency) with a logically sound hierarchical abstraction. Significant empirical gains on major benchmarks confirm the validity of decoupling logic from execution.

⚙️ Technical Details

Problem Definition

Setting: Web navigation tasks where an agent interacts with a browser over discrete time steps to fulfill a natural language instruction

Inputs: Natural language instruction q, current observation o_t (DOM/accessibility tree), interaction history h_{t-1}

Outputs: Grounded action a_t (operation, target element, arguments)

Pipeline Flow

Instruction Normalization (Raw Query → Standardized Intent)
Task Retrieval (Intent → Candidate Tasks)
Subgoal Retrieval (History/Obs → Candidate Subgoals)
Planner (Selects Stage based on Conditions)
Actor (Grounds Actions using Semantic Descriptions)

System Modules

Instruction Normalizer

Maps raw user instructions to standardized intents and constraints to stabilize retrieval

Model or implementation: Large Language Model (specific variant not named in text)

Stage-Aware Retriever

Retrieves relevant subgoals by combining semantic similarity with condition matching

Model or implementation: Embedding Model + Logic Checker

Planner

Identifies the correct logical stage by verifying observable pre-conditions against the current state

Model or implementation: Large Language Model (specific variant not named in text)

Actor

Generates concrete actions by matching abstract semantic descriptions to the current page's elements

Model or implementation: Large Language Model (specific variant not named in text)

Novel Architectural Elements

Three-level memory hierarchy (Intent-Stage-Action) explicitly designed to strip site-specific details at the leaf level
Storage of 'semantic element descriptions' instead of raw execution traces to enable dynamic re-grounding
Pre-condition/Post-condition verification mechanism within the retrieval loop to prevent temporal workflow mismatch

Modeling

Base Model: Large Language Model (specific variant not named in text)

Training Method: In-context learning / Retrieval-Augmented Generation

Key Hyperparameters:

lambda: Hyperparameter balancing semantic similarity and condition matching (value not explicitly detailed)
margin_threshold_delta: Used for confidence-aware fallback
confidence_threshold_tau: Used for confidence-aware fallback

Compute: Not reported in the paper

Comparison to Prior Work

vs. AWM: HMT uses a hierarchical tree with semantic element descriptions to decouple logic from IDs, whereas AWM stores flat workflows that may retain brittle details
vs. Reflexion [28]: HMT stores successful procedural abstractions, whereas Reflexion stores verbal reinforcement of errors
vs. Flat Retrieval (Baseline): HMT explicitly separates intent, stage, and action to prevent context pollution from irrelevant low-level details
+ 1 more
vs. MemTree [24]: HMT focuses on web trajectory abstraction with pre/post-conditions, whereas MemTree organizes dialogue history [not cited in paper as direct baseline, but related work]

Limitations

Relies on the capabilities of the underlying LLM to correctly abstract semantic descriptions and identify conditions
Construction pipeline overhead required to process raw trajectories into the HMT format
No statistical significance tests explicitly reported for the margins of improvement

Reproducibility

Code availability is not provided. The paper relies on an automated pipeline for memory construction using an LLM, but the specific prompt templates and the backbone LLM used for the experiments (e.g., GPT-4, Llama-3) are not explicitly named in the provided text. Benchmarks (Mind2Web, WebArena) are public.

📊 Experiments & Results

Evaluation Setup

Web navigation agents performing tasks based on natural language instructions using external memory

Benchmarks:

Mind2Web (General web navigation (offline memory construction))
WebArena (Realistic web environment (online memory accumulation))

Metrics:

Step Success Rate (Step SR)
Task Success Rate (Task SR)
Element Accuracy (EA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Mind2Web results demonstrate HMT's superior generalization in cross-website and cross-domain settings compared to flat memory baselines.
Mind2Web (Cross-Website)	Task SR	45.1	54.5	+9.4
Mind2Web (Cross-Domain)	Task SR	46.3	48.8	+2.5
Mind2Web (Cross-Task)	Task SR	56.4	63.5	+7.1
WebArena results validate the online memory accumulation capability of HMT in a realistic, dynamic environment.
WebArena	Total Success Rate	32.1	38.7	+6.6
WebArena	Maps Domain SR	31.8	42.2	+10.4

Experiment Figures

Conceptual illustration of 'intention-execution entanglement' in flat memory vs. the decoupled approach in HMT.

Main Takeaways

HMT consistently outperforms flat-memory methods (AWM, Flat Retrieval) across both offline and online settings.
The performance gap is largest in Cross-Website scenarios (Mind2Web), validating the hypothesis that hierarchical abstraction enables better transfer to unseen environments.
The decoupling of action patterns from element IDs allows the agent to maintain high success rates even when the underlying DOM structure changes completely.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Web Agents (DOM, accessibility trees, element grounding)
Familiarity with RAG (Retrieval-Augmented Generation)
Basic knowledge of hierarchical planning (Planner/Actor models)

Key Terms

HMT: Hierarchical Memory Tree—the proposed memory structure organizing trajectories into Intent, Stage, and Action levels

Intention-execution entanglement: The failure mode where transferable high-level task logic is inextricably linked to non-transferable, site-specific action details in memory

Semantic element description: A transferable description of a UI element (e.g., role, label, position) used in memory instead of raw IDs to enable cross-site grounding

AWM: Agent Workflow Memory—a baseline method that induces reusable workflows from interaction traces

Step SR: Step Success Rate—the percentage of individual steps correctly predicted/executed

Task SR: Task Success Rate—the percentage of full tasks successfully completed

DOM: Document Object Model—the structural representation of a webpage

Pre-conditions: Observable states (e.g., 'search results visible') that must exist before a specific memory stage can be retrieved

SFT: Supervised Fine-Tuning