Hybrid Self-evolving Structured Memory for GUI Agents

📝 Paper Summary

Memory for GUI Agents Vision-Language Model (VLM) Agents

HyMEM equips GUI agents with a brain-inspired memory graph that couples high-level symbolic strategies with fine-grained trajectory embeddings, enabling continuous self-evolution and dynamic context refreshing during long tasks.

Core Problem

Current GUI agents fail at long-horizon tasks because their memories are either too abstract (losing visual details) or unstructured (flat retrieval), preventing them from organizing strategies or adapting to phase changes.

Why it matters:

Real-world computer tasks involve long workflows where error accumulation leads to failure
Existing flat retrieval systems (RAG) cannot organize knowledge hierarchically or update it efficiently, leading to stale or redundant context
Neuroscience suggests effective memory requires both hippocampal (detailed episodic) and neocortical (generalized semantic) components, which current agents lack

Concrete Example: In a long shopping task, an agent might successfully 'search' but fail at 'checkout' because it retrieves irrelevant search-related memories. HyMEM detects this phase shift and refreshes the working memory to discard search details and retrieve checkout strategies.

Key Novelty

Hybrid Self-evolving Structured Memory (HyMEM)

Constructs a memory graph where nodes represent interactions at three levels: discrete strategies (text), semantic attributes (tags), and continuous trajectory embeddings (visual vectors)
Uses a 'VLM Judge' to enforce self-evolution, deciding whether to Add, Merge, or Replace nodes based on information gain rather than just appending new data
Implements on-the-fly working memory updates by detecting execution phase shifts (e.g., from search to payment) and re-retrieving relevant context mid-task

Evaluation Highlights

Boosts Qwen2.5-VL-7B performance by +22.5% (from 12.5% to 35.0%) on GUI benchmarks, surpassing the baseline significantly
Outperforms proprietary models: exceeds Gemini2.5-Pro-Vision by 5.4% and GPT-4o by 15.3% using a 7B backbone
Enables open-source 7B/8B models (like UI-TARS-1.5-7B and Qwen3-VL-8B) to match or exceed strong closed-source systems

Breakthrough Assessment

9/10

Proposes a sophisticated, brain-inspired memory architecture that solves key structural and evolution problems in agent memory, yielding massive performance jumps (+22.5%) and allowing small models to beat GPT-4o.

⚙️ Technical Details

Problem Definition

Setting: GUI Agent interaction where the agent must plan and execute actions a_t based on visual observations o_t, language instructions q, and retrieved memory

Inputs: User instruction q, current screenshot o_t, and interaction history

Outputs: Action a_t (e.g., click, type) and memory update operations

Pipeline Flow

Observation Encoding (CLIP)
Global Retrieval (Graph Traversal)
Working Memory Construction (Hybrid Context)
Action Prediction (VLM + SOM)
Dynamic Refresh (Phase Shift Detection)

System Modules

Hybrid Retrieval

Identify relevant memory nodes using semantic search and graph connectivity

Model or implementation: CLIP + FAISS

Working Memory Constructor (Context Management)

Synthesize retrieved nodes into actionable context

Model or implementation: Qwen2.5-VL-7B (VLM)

Action Predictor

Predict the next GUI action based on current screen and working memory

Model or implementation: Qwen2.5-VL-7B + UI-INS-7B (fallback)

Phase Detector (Context Management)

Detect if the task phase has shifted to trigger memory refresh

Model or implementation: VLM

Novel Architectural Elements

Hybrid Node Structure: Graph nodes contain both discrete text (Strategy/Attributes) and continuous tensors (Trajectory Embeddings)
Dynamic Refresh Loop: An explicit feedback loop during inference that re-triggers retrieval based on visual phase shifts, distinct from standard static RAG

Modeling

Base Model: Qwen2.5-VL-7B

Training Method: Inference-time architecture (No gradient updates to backbone reported in pipeline)

Adaptation: None (Frozen backbone utilized)

Trainable Parameters: None (Memory is updated, not model weights)

Training Data:

Memory constructed from 2,883 successful trajectories
Sources: GUIAct, Mind2Web Training Set, Agent Rollouts

Compute: Lightweight VLM (7B) used for all memory encoding and judging tasks

Comparison to Prior Work

vs. AppAgent: HyMEM adds long-term structured memory to prevent error accumulation
vs. CoMEM: HyMEM adds a discrete pathway (graph of strategies) for better high-level planning, rather than just using continuous embeddings
vs. ExpeL: HyMEM incorporates continuous visual embeddings into the graph nodes, whereas ExpeL is primarily text-based
+ 1 more
vs. MemGPT [not cited in paper]: HyMEM focuses on multimodal GUI trajectories and graph evolution, while MemGPT focuses on text-context management via OS-like paging

Limitations

Reliance on the capability of the VLM Judge: if the judge misclassifies a strategy as 'new', memory redundancy increases
Computational cost of continuous re-retrieval during dynamic refresh phases
Dependence on accurate Set-of-Mark (SOM) detection for action grounding
Graph complexity may grow with very large diverse datasets, requiring pruning strategies (though Add/Merge/Replace helps)

Reproducibility

Code availability is mentioned as 'CodeWebsite' but no URL is provided in the text. The method uses open models (Qwen2.5-VL, CLIP) and standard libraries (FAISS). Memory construction relies on specific VLM prompts for the 'Judge' which are described conceptually.

📊 Experiments & Results

Evaluation Setup

GUI Agent evaluation on complex computer-use tasks

Benchmarks:

GUI Tasks Benchmark (Long-horizon computer use (implied GUIAct/Mind2Web))

Metrics:

Success Rate
Relative Improvement (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
HyMEM significantly improves open-source models, allowing them to compete with state-of-the-art closed-source models.
GUI Tasks Benchmark	Success Rate	12.5	35.0	+22.5
GUI Tasks Benchmark	Relative Performance vs SOTA	29.6	35.0	+5.4
GUI Tasks Benchmark	Relative Performance vs SOTA	19.7	35.0	+15.3

Main Takeaways

Consistent improvements observed across different backbones (Qwen2.5-VL, UI-TARS-1.5, Qwen3-VL), showing the memory module is model-agnostic
Small models (7B/8B) equipped with HyMEM can match or beat much larger closed-source models (GPT-4o, Gemini) on GUI tasks
The 'self-evolving' mechanism (Add/Merge/Replace) is crucial for maintaining memory quality without uncontrolled growth

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs)
Retrieval-Augmented Generation (RAG)
Graph Data Structures
CLIP embeddings

Key Terms

GUI: Graphical User Interface—the visual computer screen the agent interacts with

VLM: Vision-Language Model—AI models that can process both images (screenshots) and text

SOM: Set-of-Mark—a prompting technique that overlays numbered tags on UI elements to help models refer to specific coordinates

CoMEM: A continuous memory method used as a baseline/component, compressing trajectories into dense embeddings

CLIP: Contrastive Language-Image Pre-training—a model used here to embed text and images into a shared space for similarity search

FAISS: Facebook AI Similarity Search—a library for efficient similarity search of dense vectors

ReAct: Reasoning + Acting—a prompting paradigm where agents generate a thought trace before taking an action

RAG: Retrieval-Augmented Generation—enhancing model outputs by retrieving relevant data from an external source