Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

📝 Paper Summary

Mobile UI Agents Multi-Modal Large Language Models (MLLMs) Multi-Agent Collaboration

Mobile-Agent-v2 employs a multi-agent architecture (Planning, Decision, Reflection) with a memory unit to solve navigation challenges and context length limitations in mobile device operations.

Core Problem

Single-agent MLLMs struggle with mobile device operations due to overly long, interleaved text-image history sequences and the difficulty of retaining focus content across multi-step tasks.

Why it matters:

Long history sequences degrade MLLM performance, making it hard to track task progress
Important information (focus content) from previous screens is often lost in long contexts, preventing successful completion of dependent sub-tasks
Existing single-agent architectures lack robust error correction mechanisms when operations fail or hallucinate

Concrete Example: In a task requiring writing sports news, an agent must first query match results. In single-agent setups, the lengthy history of searching for results obscures the actual scores when the agent finally attempts to write the news, causing it to fail or hallucinate the content.

Key Novelty

Multi-Agent Collaboration with Specialized Roles (Planning, Decision, Reflection)

Decomposes the operation process into three agents: a Planner that summarizes history into text, a Decider that executes actions and updates memory, and a Reflector that verifies outcomes.
Introduces a dedicated Memory Unit to store 'focus content' (task-relevant info like a weather forecast or match score) separately from the raw operation history, preventing information loss.

Architecture

The iterative workflow of Mobile-Agent-v2 showing the interaction between the three agents (Planning, Decision, Reflection) and the Memory Unit.

Evaluation Highlights

+30% improvement in task completion rate compared to the single-agent Mobile-Agent architecture
Achieves >90% success rate on basic instruction following tasks (Mobile-Eval)
Significantly reduces effective context length by condensing image-text history into pure-text task progress summaries

Breakthrough Assessment

8/10

Significant architectural advance by applying multi-agent patterns to mobile UI automation. Effectively solves the context-length bottleneck that plagues single-agent visual approaches.

⚙️ Technical Details

Problem Definition

Setting: Automated execution of multi-step user instructions on a mobile operating system using visual perception

Inputs: User instruction (natural language) and the current mobile screen screenshot

Outputs: Discrete mobile operations (Tap, Swipe, Type, etc.) until task completion

Pipeline Flow

Visual Perception Module (processes screen)
Planning Agent (summarizes history into text progress)
Decision Agent (observes screen/memory, generates action, updates memory)
Reflection Agent (verifies action outcome)

System Modules

Visual Perception Module

Enhance screen recognition by detecting text and icons

Model or implementation: OCR tool + Icon Detection tool + Icon Description tool

Planning Agent

Condense lengthy history into a concise textual summary of task progress

Model or implementation: LLM (text-only)

Decision Agent

Generate operations and update the memory unit with focus content

Model or implementation: MLLM (GPT-4V)

Reflection Agent

Observe screens before/after operation to detect errors or ineffective actions

Model or implementation: MLLM (GPT-4V)

Novel Architectural Elements

Three-agent collaborative loop (Planning, Decision, Reflection) replacing single-agent loop
Decoupling of history tracking: Planning Agent handles history summarization (text-only) to relieve the Decision Agent's context window
Explicit Memory Unit for 'focus content' that is read/written by the Decision Agent

Modeling

Base Model: GPT-4V (for Decision and Reflection agents)

Reproducibility

Code: https://github.com/X-PLUG/MobileAgent

📊 Experiments & Results

Evaluation Setup

Dynamic evaluation across various operating systems and applications using Mobile-Eval benchmark

Benchmarks:

Mobile-Eval (Mobile device operation instructions)

Metrics:

Task Completion Rate (Sucess Rate)
Process Score (Accuracy of individual steps)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Mobile-Eval	Task Completion	Not explicitly reported as a single aggregate number in text (derived from relative claim)	Not explicitly reported as a single aggregate number in text	-

Experiment Figures

A comparison between Single-Agent and Multi-Agent navigation on a sports news writing task.

Main Takeaways

Mobile-Agent-v2 significantly outperforms single-agent baselines, particularly in long-horizon tasks requiring memory of previous steps.
The Planning Agent successfully condenses history, preventing context overflow which causes failure in single-agent architectures.
The Reflection Agent effectively catches erroneous operations (e.g., wrong chat opened), allowing the system to backtrack and self-correct.
Manual knowledge injection (providing usage manuals) further enhances performance.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multi-Modal Large Language Models (MLLMs) like GPT-4V
Basic knowledge of mobile UI interactions (XML vs. Visual Perception)
Familiarity with agentic workflows (Planning, Action, Reflection)

Key Terms

MLLM: Multi-Modal Large Language Model—an AI model capable of processing and generating both text and image data

Focus Content: Specific task-relevant information extracted from history screens (e.g., a phone number or match score) needed for subsequent operations

Task Progress: A pure-text summary generated by the Planning Agent describing completed sub-tasks, replacing the raw history of images and actions

Visual Perception Module: A component that converts raw screenshots into structured text and icon coordinates using OCR and icon detection tools

Hallucination: A phenomenon where the model generates incorrect or non-existent information/actions not supported by the input data

UI: User Interface—the visual elements on a screen that a user interacts with