← Back to Paper List

MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment

Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, Dong Yu
arXiv.org (2025)
Agent RL MM Benchmark

📝 Paper Summary

Mobile GUI Agents Online Reinforcement Learning
MobileGUI-RL trains vision-based agents in online environments using a scalable batched infrastructure, a synthetic curriculum filtered for feasibility, and a trajectory-aware variation of Group Relative Policy Optimization.
Core Problem
Training GUI agents via offline supervised learning leads to overfitting on static templates and poor generalization, while online RL struggles with slow interaction speeds and sparse rewards in long-horizon tasks.
Why it matters:
  • Offline methods rely on labor-intensive, high-quality annotations that are hard to scale
  • Static policies fail in real-world apps where UI elements change dynamically or disappear unpredictably
  • Standard RL fails to converge efficiently because many action sequences yield no reward (sparse signal) or fail due to a single misstep
Concrete Example: A real-world GUI might introduce a new screen or pop-up not seen in training data. An agent trained offline on static traces will fail to adapt, whereas an online agent needs to explore, but standard rewards (0 or 1) don't distinguish between a fast, efficient success and a clumsy, slow one.
Key Novelty
MobileGUI-RL (Online RL Framework for GUI)
  • Generates a curriculum of tasks by having an exploration agent perform random walks, using GPT-4o to reverse-engineer instructions from those walks, and filtering them with a world model to ensure they are solvable
  • Adapts GRPO (Group Relative Policy Optimization) into 'MobGRPO' by assigning advantage scores to the entire trajectory rather than individual steps, enabling learning from long-horizon sparse rewards
Breakthrough Assessment
8/10
Addresses the critical bottleneck of static data in GUI agents by making online RL feasible through batched execution and curriculum learning. The trajectory-level advantage formulation is a smart adaptation for sparse-reward tasks.
×