DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

📝 Paper Summary

Android device control Multi-modal agents

DigiRL trains vision-language models to control Android devices by fine-tuning them on their own autonomous interactions using a scalable parallel environment and a specialized offline-to-online reinforcement learning algorithm.

Core Problem

Off-the-shelf vision-language models struggle with real-world device control because static training data fails to capture the stochasticity, non-stationarity, and dynamic nature of in-the-wild GUIs.

Why it matters:

Static demonstrations become stale as apps update, leading to agents that cannot adapt to visual changes or recover from their own mistakes.
Proprietary model wrappers (like GPT-4V with prompting) are slow, expensive, and limited by the base model's inability to reason about low-level pixel actions.
Existing RL methods for device control are often too sample-inefficient or require simplified, deterministic environments that don't reflect real-world complexity.

Concrete Example: When asked to 'Go to newegg.com and search for razer kraken', a model trained on static data might fail if a new pop-up ad appears or the search bar location changes, unable to recover because it never experienced these specific failures during training.

Key Novelty

Scalable Offline-to-Online RL for Device Control

Builds a parallelized Android environment that runs up to 64 emulators simultaneously with a VLM-based autonomous evaluator to provide reward signals without human intervention.
Uses a curriculum-based Advantage-Weighted Regression (AWR) approach that filters training data based on task difficulty (instruction-level value) and action quality (step-level value) to handle noisy, stochastic environments.

Architecture

Overview of the DigiRL training pipeline, showing the two-stage process (Offline RL -> Online RL) and the parallelized environment.

Evaluation Highlights

Achieves 67.2% success rate on Android-in-the-Wild (AitW) tasks, a +49.5% absolute improvement over supervised fine-tuning (17.7%).
Outperforms state-of-the-art 18B CogAgent (38.5%) and GPT-4V based AppAgent (8.3%) despite using a smaller 1.3B parameter model.
Surpasses the best prior autonomous method (Filtered Behavior Cloning) by over 9% (57.8% vs 67.2%).

Breakthrough Assessment

9/10

Establishes a new SOTA for device control by successfully scaling autonomous offline-to-online RL, demonstrating that small models trained on self-experience can significantly outperform much larger proprietary models.

⚙️ Technical Details

Problem Definition

Setting: Finite horizon Markov Decision Process (MDP) for pixel-based device control

Inputs: Natural language instruction c, current screen screenshot s_t

Outputs: Action a_t (tap/slide coordinates, type text, or system buttons)

Pipeline Flow

Observation Processing (Screenshot → VLM)
Action Generation (VLM → Action Token)
Execution (Action Token → Android Emulator)

System Modules

Vision Encoder

Encodes the raw screenshot into visual features

Model or implementation: PaliGemma 3B variant (specifically 1.3B parameters used in experiments)

Policy Network

Generates the action description based on instruction and visual features

Model or implementation: PaliGemma 1.3B (Transformer Decoder)

Android Emulator

Executes the action and updates the state

Model or implementation: Android Emulator (avd)

Novel Architectural Elements

Parallelized Android Learning Environment allowing real-time collection of experience from 64 concurrent emulators
Integration of a VLM-based autonomous evaluator (Gemini 1.5 Pro) into the training loop for scalable reward generation

Modeling

Base Model: PaliGemma (1.3B parameters)

Training Method: DigiRL (Advantage-Weighted Regression with Auto-Curriculum)

Objective Functions:

Purpose: Update the policy to maximize advantage-weighted log-likelihood.

Formally: L_actor = - sum [ I(A_step > 1/H) * log pi(a|s,c) ]
Purpose: Train step-level value function to predict success probability from state.

Formally: Cross-entropy loss on Monte-Carlo trajectory rewards
Purpose: Train instruction-level value function to predict task difficulty for curriculum.

Formally: Cross-entropy loss on trajectory rewards

Adaptation: Full fine-tuning

Trainable Parameters: 1.3B

Training Data:

Offline phase: AitW training set
Online phase: Autonomous rollouts on Android emulators using AitW instructions

Key Hyperparameters:

doubly_robust_lambda: Not explicitly reported in the paper
advantage_threshold: 1/H (where H is horizon)

Compute: 64 parallel Android emulators for data collection

Comparison to Prior Work

vs. AppAgent: Trains weights via RL instead of using frozen API model with prompting; significantly faster and more accurate
vs. CogAgent: Achieves higher performance with much smaller model (1.3B vs 18B) via active online interaction
vs. Filtered BC: Uses advantage-weighted RL with curriculum rather than just cloning successful trajectories; handles stochasticity better

Limitations

Relies on a proprietary VLM (Gemini 1.5 Pro) for reward evaluation, which may have costs and access limits.
Requires significant infrastructure (64 parallel emulators) for effective online training.
Experiments limited to Android; transfer to other OS (e.g., Desktop, iOS) not tested.

Reproducibility

Code availability is not provided in the paper. The method relies on a proprietary evaluator (Gemini 1.5 Pro) and a complex infrastructure of 64 parallel Android emulators which may be difficult to replicate without significant compute resources.

📊 Experiments & Results

Evaluation Setup

Evaluation on real Android emulators using instructions from the Android-in-the-Wild (AitW) dataset.

Benchmarks:

Android-in-the-Wild (AitW) (Device Control / GUI Navigation)

Metrics:

Success Rate (SR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on Android-in-the-Wild (AitW) dataset showing DigiRL significantly outperforming baselines.
AitW (General + Web Shopping)	Success Rate	17.7	67.2	+49.5
AitW (General + Web Shopping)	Success Rate	38.5	67.2	+28.7
AitW (General + Web Shopping)	Success Rate	8.3	67.2	+58.9
AitW (General + Web Shopping)	Success Rate	57.8	67.2	+9.4

Experiment Figures

Performance comparison over time between a frozen policy and a continuously updating policy.

Main Takeaways

Autonomous RL (DigiRL) significantly outperforms Supervised Fine-Tuning (SFT) and Behavior Cloning by effectively learning from online interactions and failures.
The curriculum mechanism is crucial: prioritizing tasks based on learned difficulty (instruction-level value) extracts maximal learning signal.
Small models (1.3B) trained with specialized RL can outperform massive proprietary models (GPT-4V) and large specialist models (CogAgent 18B) on device control tasks.
Handling stochasticity via doubly-robust advantage estimation is essential for real-world GUI environments.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals (MDPs, value functions, policy gradients)
Vision-Language Models (VLMs)
Android/GUI operating environments

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

VLM: Vision-Language Model—a model that processes both image and text inputs to generate text or actions

AWR: Advantage-Weighted Regression—an off-policy RL algorithm that updates the policy by regressing on actions with high estimated advantages

AitW: Android-in-the-Wild—a dataset and benchmark for Android device control tasks

Doubly-Robust Estimator: A statistical technique for estimating advantages that combines Monte-Carlo returns (low bias, high variance) with value function estimates (high bias, low variance) to reduce overall error

Instruction-level Value Function: A learned function that predicts the expected success rate of a specific instruction, used to prioritize harder or more informative tasks (curriculum learning)

Step-level Value Function: A learned function that predicts the expected future reward from a specific state, used to compute advantages for specific actions

SFT: Supervised Fine-Tuning—training a model on a fixed dataset of expert demonstrations

GUI: Graphical User Interface—the visual interface of a device (icons, buttons, windows) that the agent interacts with

Filtered Behavior Cloning: An imitation learning approach where the agent clones only the successful trajectories from its own past experiences

CogAgent: A large vision-language model specifically designed for GUI agents

AppAgent: A prior agent framework that uses LLMs/VLMs (like GPT-4V) to control apps via simplified action spaces

Gemini 1.5 Pro: A large proprietary multimodal model by Google

Auto-Curriculum: A mechanism to automatically select which tasks to train on based on their estimated learning value or difficulty