MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents

📝 Paper Summary

Mobile GUI Agents Online Reinforcement Learning Vision Language Models

MobileRL improves mobile GUI agents by scaling online reinforcement learning with difficulty-adaptive strategies that prioritize solvable tasks and reward shorter successful trajectories.

Core Problem

Training mobile GUI agents via online RL is inefficient due to sparse rewards, heavy-tailed task difficulty (many tasks are unsolvable or trivial), and the high latency of mobile emulators.

Why it matters:

Supervised fine-tuning on static data limits behavior coverage and error recovery capabilities
Naive sampling in expensive mobile simulators wastes computational budget on persistently unsolvable tasks or redundant successes
Base vision-language models struggle to produce correct action commands for complex GUI instructions without dense feedback

Concrete Example: In a task like 'add an event for tomorrow at 3pm', a standard RL agent might fail repeatedly without feedback until the horizon is reached, or randomly succeed via a very long, inefficient path, reinforcing verbose behavior.

Key Novelty

Difficulty-Adaptive Group Relative Policy Optimization (AdaGRPO)

Optimizes policy using a group-relative baseline that adapts to task difficulty by filtering out persistently failing tasks (Failure Curriculum Filtering)
Reshapes sparse binary success rewards to favor shorter, more efficient trajectories (Shortest-Path Reward Adjustment)
Maintains a buffer of rare, challenging successful trajectories to replay positive signals alongside on-policy data (Difficulty-Adaptive Positive Replay)

Architecture

The MobileRL framework pipeline, illustrating the three stages: Reasoning-free SFT, Reasoning SFT, and Agentic RL with AdaGRPO.

Evaluation Highlights

MobileRL-9B achieves 80.2% success rate on AndroidWorld, surpassing the previous state-of-the-art (64.2%)
MobileRL-9B achieves 53.6% success rate on AndroidLab, outperforming the previous best (41.2%)
MobileRL-7B (+16% on AndroidWorld) outperforms significantly larger 72B-parameter models like UI-Tars-1.5 and UI-Genie-Agent

Breakthrough Assessment

9/10

Significant jump in SOTA performance on major benchmarks (AndroidWorld/AndroidLab) with a scalable framework that addresses the core bottleneck of efficiency in agentic RL.

⚙️ Technical Details

Problem Definition

Setting: Finite-horizon Markov Decision Process (MDP) M=(S, A, P, r, H, μ0)

Inputs: Natural language instruction c and initial state s_0

Outputs: Sequence of atomic GUI actions (Tap, Swipe, Type, etc.) ending in Finish or horizon H

Pipeline Flow

Input Instruction & Screen → Reasoning SFT (Warm-up) → Agentic RL (AdaGRPO) → Action Execution
Group: RL Optimization Loop

System Modules

Reasoning SFT Warm-up

Initializes the policy with reasoning capabilities using bootstrapped Chain-of-Thought data from expert demonstrations

Model or implementation: Qwen2.5-VL-7B-Instruct or GLM-4.1V-9B-Base

AdaGRPO Optimization

Optimizes the policy using adaptive sampling and reward shaping

Model or implementation: Same as Warm-up model (weights updated)

Distributed Environment Manager

Orchestrates hundreds of Dockerized Android AVDs for concurrent sampling

Model or implementation: Docker / Android Emulator

Novel Architectural Elements

Integration of Failure Curriculum Filtering (FCF) directly into the GRPO sampling loop to prune dead-end tasks
Shortest-Path Reward Adjustment (SPA) mechanism modifying the standard GRPO advantage calculation

Modeling

Base Model: Qwen2.5-VL-7B-Instruct and GLM-4.1V-9B-Base

Training Method: Difficulty-Adaptive Group Relative Policy Optimization (AdaGRPO)

Objective Functions:

Purpose: Optimize policy to maximize expected reward relative to group average, constrained by KL divergence.

Formally: L_GRPO(θ) = E [ (π_θ(a|s)/π_old(a|s)) * A_hat ] - β * D_KL(π_θ || π_ref)
Purpose: Reshape sparse binary rewards to penalize length.

Formally: R(τ) = r(τ) * (T_min / T_i)^α
Purpose: Down-weight sampling probability for persistently failing tasks.

Formally: w_task = exp(-f) after cooldown

Adaptation: Full fine-tuning

Training Data:

Expert demonstrations from AndroidWorld and AndroidControl for SFT
Bootstrapped reasoning data generated by Instruct model for Reasoning SFT

Key Hyperparameters:

kl_beta: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper
+ 1 more
reward_adjustment_alpha: Parameter α in (0, 1] (exact value not specified in text)

Compute: Hundreds of Dockerized Android virtual devices (AVDs) across multiple machines; >1,000 concurrent environments supported

Comparison to Prior Work

vs. UI-Tars: MobileRL-7B outperforms UI-Tars-72B despite being 10x smaller, due to online agentic RL vs. SFT/offline approaches
vs. DigiRL: MobileRL uses online on-policy learning with adaptive sampling (AdaGRPO) rather than offline RL
vs. AppAgent: MobileRL employs large-scale parallel training with verifiable rewards rather than just prompt engineering or limited exploration
+ 1 more
vs. GLM-4V-9B (Base): MobileRL adds reasoning SFT and AdaGRPO to significantly boost success rates

Limitations

Relies on verifiable rewards which may not exist for all real-world open-ended tasks
High computational cost for environment simulation (requires managing hundreds of AVDs)
Training efficiency depends heavily on the quality of the reasoning SFT warm-up

Reproducibility

Code: https://github.com/THUDM/MobileRL

Code and framework open-sourced at https://github.com/THUDM/MobileRL. Built on Verl framework. Uses Dockerized Android emulators. Hyperparameters like learning rate and batch size are not explicitly detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Interactive Android environment execution

Benchmarks:

AndroidWorld (General-purpose mobile interaction tasks)
AndroidLab (Mobile app interaction tasks)

Metrics:

Success Rate (SR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MobileRL achieves state-of-the-art performance on AndroidWorld, surpassing much larger models.
AndroidWorld	Success Rate	64.2	80.2	+16.0
AndroidWorld	Success Rate	47.7	73.5	+25.8
MobileRL demonstrates superior performance on AndroidLab compared to baselines.
AndroidLab	Success Rate	41.2	53.6	+12.4
AndroidLab	Success Rate	29.2	52.0	+22.8

Main Takeaways

Online agentic RL with difficulty adaptation (AdaGRPO) significantly outperforms SFT and zero-shot baselines, even those with 10x more parameters.
The two-stage warm-up (Reasoning-free SFT + Reasoning SFT) is crucial for efficient RL exploration.
Scalable distributed sampling (hundreds of AVDs) enables effective training on heavy-tailed distributions where successes are initially rare.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO/GRPO)
Vision Language Models (VLMs)
Markov Decision Processes (MDP)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a trajectory's reward to the average reward of a group of trajectories for the same input

SFT: Supervised Fine-Tuning—training a model on labeled examples to provide a warm start before RL

AdaGRPO: Difficulty-Adaptive Group Relative Policy Optimization—the proposed algorithm extending GRPO with difficulty-aware sampling and reward shaping

AdaPR: Difficulty-Adaptive Positive Replay—a mechanism to store and reuse rare successful trajectories from hard tasks to improve sample efficiency

FCF: Failure Curriculum Filtering—a strategy to temporarily stop sampling tasks that have consistently failed, saving compute

SPA: Shortest-Path Reward Adjustment—modifying the binary success reward to favor shorter trajectories, penalizing inefficient paths

VLM: Vision Language Model—a multimodal model capable of processing both text and images (screenshots)

AVD: Android Virtual Device—a software emulator for the Android operating system

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution