← Back to Paper List

Hybrid Self-evolving Structured Memory for GUI Agents

Sibo Zhu, Wenyi Wu, Kun Zhou, Stephen Wang, Biwei Huang
University of California, San Diego, Abel.ai
arXiv (2026)
Memory Agent MM

📝 Paper Summary

Memory for GUI Agents Vision-Language Model (VLM) Agents
HyMEM equips GUI agents with a brain-inspired memory graph that couples high-level symbolic strategies with fine-grained trajectory embeddings, enabling continuous self-evolution and dynamic context refreshing during long tasks.
Core Problem
Current GUI agents fail at long-horizon tasks because their memories are either too abstract (losing visual details) or unstructured (flat retrieval), preventing them from organizing strategies or adapting to phase changes.
Why it matters:
  • Real-world computer tasks involve long workflows where error accumulation leads to failure
  • Existing flat retrieval systems (RAG) cannot organize knowledge hierarchically or update it efficiently, leading to stale or redundant context
  • Neuroscience suggests effective memory requires both hippocampal (detailed episodic) and neocortical (generalized semantic) components, which current agents lack
Concrete Example: In a long shopping task, an agent might successfully 'search' but fail at 'checkout' because it retrieves irrelevant search-related memories. HyMEM detects this phase shift and refreshes the working memory to discard search details and retrieve checkout strategies.
Key Novelty
Hybrid Self-evolving Structured Memory (HyMEM)
  • Constructs a memory graph where nodes represent interactions at three levels: discrete strategies (text), semantic attributes (tags), and continuous trajectory embeddings (visual vectors)
  • Uses a 'VLM Judge' to enforce self-evolution, deciding whether to Add, Merge, or Replace nodes based on information gain rather than just appending new data
  • Implements on-the-fly working memory updates by detecting execution phase shifts (e.g., from search to payment) and re-retrieving relevant context mid-task
Evaluation Highlights
  • Boosts Qwen2.5-VL-7B performance by +22.5% (from 12.5% to 35.0%) on GUI benchmarks, surpassing the baseline significantly
  • Outperforms proprietary models: exceeds Gemini2.5-Pro-Vision by 5.4% and GPT-4o by 15.3% using a 7B backbone
  • Enables open-source 7B/8B models (like UI-TARS-1.5-7B and Qwen3-VL-8B) to match or exceed strong closed-source systems
Breakthrough Assessment
9/10
Proposes a sophisticated, brain-inspired memory architecture that solves key structural and evolution problems in agent memory, yielding massive performance jumps (+22.5%) and allowing small models to beat GPT-4o.
×