Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang
Alibaba Group
Neural Information Processing Systems
(2024)
MMAgentMemory
📝 Paper Summary
Mobile UI AgentsMulti-Modal Large Language Models (MLLMs)Multi-Agent Collaboration
Mobile-Agent-v2 employs a multi-agent architecture (Planning, Decision, Reflection) with a memory unit to solve navigation challenges and context length limitations in mobile device operations.
Core Problem
Single-agent MLLMs struggle with mobile device operations due to overly long, interleaved text-image history sequences and the difficulty of retaining focus content across multi-step tasks.
Why it matters:
Long history sequences degrade MLLM performance, making it hard to track task progress
Important information (focus content) from previous screens is often lost in long contexts, preventing successful completion of dependent sub-tasks
Existing single-agent architectures lack robust error correction mechanisms when operations fail or hallucinate
Concrete Example:In a task requiring writing sports news, an agent must first query match results. In single-agent setups, the lengthy history of searching for results obscures the actual scores when the agent finally attempts to write the news, causing it to fail or hallucinate the content.
Key Novelty
Multi-Agent Collaboration with Specialized Roles (Planning, Decision, Reflection)
Decomposes the operation process into three agents: a Planner that summarizes history into text, a Decider that executes actions and updates memory, and a Reflector that verifies outcomes.
Introduces a dedicated Memory Unit to store 'focus content' (task-relevant info like a weather forecast or match score) separately from the raw operation history, preventing information loss.
Architecture
The iterative workflow of Mobile-Agent-v2 showing the interaction between the three agents (Planning, Decision, Reflection) and the Memory Unit.
Evaluation Highlights
+30% improvement in task completion rate compared to the single-agent Mobile-Agent architecture
Achieves >90% success rate on basic instruction following tasks (Mobile-Eval)
Significantly reduces effective context length by condensing image-text history into pure-text task progress summaries
Breakthrough Assessment
8/10
Significant architectural advance by applying multi-agent patterns to mobile UI automation. Effectively solves the context-length bottleneck that plagues single-agent visual approaches.
⚙️ Technical Details
Problem Definition
Setting: Automated execution of multi-step user instructions on a mobile operating system using visual perception
Inputs: User instruction (natural language) and the current mobile screen screenshot
Outputs: Discrete mobile operations (Tap, Swipe, Type, etc.) until task completion
Pipeline Flow
Visual Perception Module (processes screen)
Planning Agent (summarizes history into text progress)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Mobile-Eval
Task Completion
Not explicitly reported as a single aggregate number in text (derived from relative claim)
Not explicitly reported as a single aggregate number in text
-
Experiment Figures
A comparison between Single-Agent and Multi-Agent navigation on a sports news writing task.
Main Takeaways
Mobile-Agent-v2 significantly outperforms single-agent baselines, particularly in long-horizon tasks requiring memory of previous steps.
The Planning Agent successfully condenses history, preventing context overflow which causes failure in single-agent architectures.
The Reflection Agent effectively catches erroneous operations (e.g., wrong chat opened), allowing the system to backtrack and self-correct.
Manual knowledge injection (providing usage manuals) further enhances performance.
📚 Prerequisite Knowledge
Prerequisites
Understanding of Multi-Modal Large Language Models (MLLMs) like GPT-4V
Basic knowledge of mobile UI interactions (XML vs. Visual Perception)
Familiarity with agentic workflows (Planning, Action, Reflection)
Key Terms
MLLM: Multi-Modal Large Language Model—an AI model capable of processing and generating both text and image data
Focus Content: Specific task-relevant information extracted from history screens (e.g., a phone number or match score) needed for subsequent operations
Task Progress: A pure-text summary generated by the Planning Agent describing completed sub-tasks, replacing the raw history of images and actions
Visual Perception Module: A component that converts raw screenshots into structured text and icon coordinates using OCR and icon detection tools
Hallucination: A phenomenon where the model generates incorrect or non-existent information/actions not supported by the input data
UI: User Interface—the visual elements on a screen that a user interacts with