Evaluation Setup
GUI Agent evaluation on complex computer-use tasks
Benchmarks:
- GUI Tasks Benchmark (Long-horizon computer use (implied GUIAct/Mind2Web))
Metrics:
- Success Rate
- Relative Improvement (%)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| HyMEM significantly improves open-source models, allowing them to compete with state-of-the-art closed-source models. |
| GUI Tasks Benchmark |
Success Rate |
12.5 |
35.0 |
+22.5
|
| GUI Tasks Benchmark |
Relative Performance vs SOTA |
29.6 |
35.0 |
+5.4
|
| GUI Tasks Benchmark |
Relative Performance vs SOTA |
19.7 |
35.0 |
+15.3
|
Main Takeaways
- Consistent improvements observed across different backbones (Qwen2.5-VL, UI-TARS-1.5, Qwen3-VL), showing the memory module is model-agnostic
- Small models (7B/8B) equipped with HyMEM can match or beat much larger closed-source models (GPT-4o, Gemini) on GUI tasks
- The 'self-evolving' mechanism (Add/Merge/Replace) is crucial for maintaining memory quality without uncontrolled growth