ZeroGUI: Automating Online GUI Learning at Zero Human Cost

📝 Paper Summary

GUI Agents Online Reinforcement Learning Synthetic Data Generation

ZeroGUI enables GUI agents to self-improve via online reinforcement learning by using VLMs to automatically generate training tasks and verify success, removing the need for human supervision.

Core Problem

Existing GUI agents rely on expensive offline human annotations and struggle to generalize to dynamic, interactive environments where elements shift or disappear.

Why it matters:

Manual collection of action trajectories and element grounding labels is costly and hard to scale across diverse applications
Agents trained on static offline data often fail in open-ended scenarios due to non-stationary environments
Real-world deployments lack ground-truth labels, preventing agents from learning from their own interactions

Concrete Example: In OSWorld, an agent might be asked to 'Browse the natural products database.' An offline-trained agent might fail if the database UI has changed. ZeroGUI allows the agent to practice on generated variations of this task and receive feedback from a VLM judge, correcting its policy without human intervention.

Key Novelty

Self-Evolving Agent Loop via VLM Simulation

Uses a VLM (Vision-Language Model) to hallucinate diverse tasks from random screenshots, creating an infinite curriculum for the agent
Replaces hand-crafted evaluation scripts with a VLM-based 'visual judge' that votes on task success based on trajectory screenshots
Adapts the GRPO algorithm for multi-step GUI interactions, enabling the agent to learn from both generated tasks and test-time scenarios

Architecture

The ZeroGUI framework workflow, illustrating the interaction between the VLM components and the GUI agent during online learning.

Evaluation Highlights

+63% relative improvement in success rate for Aguvis-7B on the OSWorld benchmark compared to the base model
+14% relative improvement for UI-TARS-7B-DPO on OSWorld, with significant gains in the feasible task subset (+40%)
Generalizes to mobile environments: +2.8 success rate improvement on the AndroidLab operation subset

Breakthrough Assessment

8/10

Strong conceptual advance in fully automating the feedback loop for GUI agents. The zero-human-cost framing addresses the primary bottleneck (data) in the field, with significant empirical gains.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (S, A, R, T) where the agent interacts with a GUI to complete instruction I

Inputs: Task instruction I and current state s_t (screenshot o_t, history h_t)

Outputs: Action prediction a_t (e.g., click, type, scroll)

Pipeline Flow

Task Generator (VLM) -> Environment
GUI Agent -> Interaction Trajectory
Reward Estimator (VLM) -> Binary Reward
RL Optimizer -> Updated Policy

System Modules

Task Generator

Propose diverse training tasks based on random initial states

Model or implementation: GPT-4o

GUI Agent

Perceive GUI and execute actions

Model or implementation: UI-TARS-7B-DPO or Aguvis-7B

Reward Estimator

Assess task success without ground truth

Model or implementation: Qwen2.5-VL-32B

Novel Architectural Elements

Two-stage online RL framework combining generated-task training and test-time adaptation via a VLM-based reward model
Integration of an automatic VLM-based reward estimator that replaces environment-specific verification scripts

Modeling

Base Model: UI-TARS-7B-DPO and Aguvis-7B

Training Method: Online Reinforcement Learning (modified GRPO)

Objective Functions:

Purpose: Optimize policy to maximize expected reward.

Formally: Average of per-step GRPO objectives J_t(θ) using normalized advantages A_hat.
Purpose: Stabilize training by penalizing deviation from reference policy.

Formally: k2-estimator (MSE) KL divergence: D_KL = 0.5 * (log π_θ - log π_ref)^2.

Training Data:

Over 4,000 generated Ubuntu tasks (sampled 725 for training)
225 generated Android tasks (sampled 175 for training)

Key Hyperparameters:

learning_rate: 2e-6
optimizer: AdamW
group_size_G: 64
+ 3 more
kl_coefficient_beta: 0.1
batch_size: 16384 (sequences per update)
rollout_temperature: 0.5

Compute: Reward estimation uses locally deployed Qwen2.5-VL-32B. Training uses 1 epoch per stage.

Comparison to Prior Work

vs. DigiRL: ZeroGUI uses online RL with VLM-based rewards instead of offline RL with rule-based verifiers
vs. Claude Computer-Use: ZeroGUI is a training framework for open models rather than a proprietary endpoint
vs. Offline RFT [not cited in paper]: ZeroGUI uses online exploration and learns from negative samples, whereas Rejection Fine-Tuning only uses positive static data
+ 1 more
vs. STaR [not cited in paper]: ZeroGUI applies self-improvement to GUI actions with a visual reward model, rather than just reasoning traces

Limitations

VLM reward estimator can still produce false positives, occasionally leading to overconfidence
Decrease in infeasibility detection performance compared to base models (due to VLM lacking specific software knowledge)
Reliance on large VLMs (GPT-4o, Qwen2.5-VL) for the training pipeline implies high computational/API cost during the learning phase

Reproducibility

Code: https://github.com/OpenGVLab/ZeroGUI

Code is publicly available at https://github.com/OpenGVLab/ZeroGUI. The paper specifies prompt templates for task generation and reward estimation in the Appendix. Base models (UI-TARS, Aguvis) are existing open weights. Training relies on GPT-4o for task generation (closed source dependency).

📊 Experiments & Results

Evaluation Setup

Interactive GUI agents operating on Desktop (Ubuntu) and Mobile (Android)

Benchmarks:

OSWorld (Desktop computer tasks (Ubuntu))
AndroidLab (Mobile app operations)

Metrics:

Success Rate (SR)
pass@4 (expected proportion of tasks solved in 4 trials)
all-pass@4 (expected proportion of tasks solved in all 4 trials)
Sub-goal Success Rate (Sub-SR)
Statistical methodology: Mean and standard deviation reported over 4 runs

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on OSWorld demonstrate that ZeroGUI significantly improves both Aguvis and UI-TARS agents, particularly on feasible tasks.
OSWorld (Full Test Set)	Success Rate	3.0	4.9	+1.9
OSWorld (Full Test Set)	Success Rate	17.7	20.2	+2.5
OSWorld (Feasible Subset)	Success Rate	11.3	15.8	+4.5
AndroidLab results confirm generalization to mobile environments.
AndroidLab (Operation Subset)	Success Rate	54.6	57.4	+2.8
OSWorld (Daily Domain)	Success Rate	24.5	27.2	+2.7

Experiment Figures

Comparison of training stability between standard GRPO (k3-KL) and the proposed modification (k2-KL).

Main Takeaways

Two-stage training is complementary: training on generated tasks expands capability coverage (improving pass@4), while test-time training improves consistency (improving all-pass@4)
False positive rewards are more detrimental than false negatives; the voting mechanism and exclusion of agent responses in reward estimation are critical for filtering these out
Replacing the standard GRPO k3-estimator with a k2-estimator (MSE loss) stabilizes training and prevents gradient overflow

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals
Vision-Language Models (VLMs) for GUI understanding
Policy Optimization algorithms

Key Terms

GUI: Graphical User Interface—visual interface enabling user interaction via icons and menus

VLM: Vision-Language Model—AI models capable of processing both images (screenshots) and text

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of samples to estimate advantages without a separate value network

SFT: Supervised Fine-Tuning—training models on labeled datasets before RL adaptation

ADB: Android Debug Bridge—command-line tool for communicating with Android devices

pass@k: A metric estimating the probability of solving a task at least once in k attempts

DPO: Direct Preference Optimization—an algorithm for aligning language models using preference pairs

k2-estimator: A specific variance reduction technique for KL divergence estimation, implemented here as Mean Squared Error (MSE)

hallucination: When a model generates incorrect or non-existent information, a key challenge in VLM-based reward estimation

test-time training: Updating the model parameters during the evaluation phase using rewards estimated from the test inputs themselves