From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation

📝 Paper Summary

GUI Agents Reinforcement Learning with Verifiable Rewards (RLVR) Imitation Learning

BEPA improves end-to-end GUI agents by re-rolling expert plans to make them policy-reachable and injecting them into reinforcement learning only when on-policy exploration fails.

Core Problem

Training end-to-end GUI agents using expert traces from framework-based systems fails due to structural mismatch (different action spaces) and distribution shift (expert trajectories lie off the student's manifold).

Why it matters:

High-quality interactive GUI environments (like OSWorld) are scarce and hard to scale, limiting on-policy exploration data
Naive mixing of off-policy expert data into on-policy algorithms like GRPO causes optimization instability and exploration collapse due to the covariate shift
End-to-end policies lag significantly behind framework-based agents on complex benchmarks, limiting their deployment utility

Concrete Example: An expert framework agent might solve a task using precise API calls or high-level tool usage. If a naive end-to-end agent tries to imitate this trace directly, it fails because it operates on raw screenshots and low-level mouse clicks, creating a 'structural mismatch' where the expert's path is unintelligible or unreachable for the student.

Key Novelty

Bi-Level Expert-to-Policy Assimilation (BEPA)

Level 1 (Self-Rolled Execution): Converts abstract expert plans into 'reachable' trajectories by forcing the base policy to execute the plan itself, discarding failures and keeping only traces the student can actually perform
Level 2 (Dynamic Assimilation): Integrates these traces into GRPO (Group Relative Policy Optimization) using a dynamic cache that updates with the student's own emerging successes, injecting guidance only when the student completely fails a task

Architecture

The BEPA framework pipeline showing the two levels of assimilation during training.

Evaluation Highlights

Achieves 32.13% success on OSWorld-Verified, improving the base UITARS1.5-7B model by +9.26 percentage points (+40.5% relative)
Doubles performance on the strictly held-out test split from 5.74% to 10.30%, demonstrating strong generalization beyond training tasks
Outperforms standard GRPO (+8.53 points) and naive expert integration methods (SFT, mixed training) across multiple benchmarks including MMBench-GUI and Online-Mind2Web

Breakthrough Assessment

8/10

Significantly closes the gap between end-to-end and framework-based GUI agents. The bi-level assimilation strategy offers a principled way to use off-policy data in on-policy RL without destabilizing training.

⚙️ Technical Details

Problem Definition

Setting: Multi-step decision making where an agent interacts with a desktop GUI to complete natural language instructions

Inputs: Screenshot s_t and instruction x

Outputs: Textual action trace a_t (autoregressively generated)

Pipeline Flow

Vision-Language Model (Policy)

System Modules

Policy Network

Map current screenshot and instruction to low-level action syntax

Model or implementation: UITARS1.5-7B (based on Qwen2-VL-7B-Instruct)

Novel Architectural Elements

Dynamic Off-Policy Cache: A training-time mechanism that stores successful trajectories (initially from self-rolled expert plans, later from on-policy successes) to inject into failed RL batches

Modeling

Base Model: UITARS1.5-7B

Training Method: GRPO (Group Relative Policy Optimization) with BEPA trace replacement

Objective Functions:

Purpose: Optimize policy to maximize reward while staying close to reference.

Formally: GRPO objective with importance sampling r_t = π(a|s)/π_old(a|s) and clipping min[r*A, clip(r, 1±ε)*A].
Purpose: Inject off-policy guidance.

Formally: If all on-policy rollouts fail (reward=0), replace the first failed trajectory with a cached successful trajectory τ_off.

Training Data:

150 'high-value' tasks from OSWorld where success is verifiable
115 expert trajectories from Agent S2 used for Level-1 seeding

Key Hyperparameters:

group_size_N: 8
max_episode_length: 15 steps
clip_epsilon: Standard PPO clip (implied)
+ 1 more
plan_extractor: GPT-4o (for Level-1 initialization)

Compute: Not reported in the paper

Comparison to Prior Work

vs. LUFFY: BEPA re-rolls expert plans to ensure reachability instead of using raw traces; BEPA only injects guidance upon total group failure rather than unconditional mixing
vs. BREAD: BEPA focuses on full trajectory reachability in GUI environments rather than prefix-based branching for text
vs. Agent S2: BEPA trains a single E2E model to imitate Agent S2's success but adapts the trajectory to its own manifold

Limitations

Reliance on a verifier restricts the training set to tasks with programmatic success detection (OSWorld-Verified subset)
Requires an initial pool of expert traces (from stronger agents) to seed the process
Level-1 self-rolling step requires an external planner (GPT-4o) to abstract expert traces into plans

Reproducibility

Code: https://github.com/LEON-gittech/Verl_GUI.git

Code and data available at https://github.com/LEON-gittech/Verl_GUI.git. Uses Agent S2 trajectories for expert data. Uses GPT-4o for plan extraction.

📊 Experiments & Results

Evaluation Setup

Desktop computer control tasks evaluated via screenshots and DOM interactions

Benchmarks:

OSWorld-Verified (Desktop OS control (Ubuntu))
MMBench-GUI (Mobile/Web GUI interactions)
Online-Mind2Web (Web navigation)

Metrics:

Success Rate (SR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
OSWorld-Verified	Success Rate	22.87	32.13	+9.26
OSWorld-Verified	Success Rate	23.60	32.13	+8.53
OSWorld-Verified (Held-out split)	Success Rate	5.74	10.30	+4.56
OSWorld-Verified	Success Rate	21.05	32.13	+11.08

Experiment Figures

t-SNE visualization of trajectory embeddings comparing the Base Policy, Expert traces, and BEPA's self-rolled traces.

Main Takeaways

Naive integration of expert traces (SFT or simple mixing) can degrade performance in GUI agents due to severe distribution shift.
Self-rolling expert plans (Level-1) is critical to create 'reachable' training data that sits on the student policy's manifold.
Conditional injection (Level-2)—only using expert data when exploration fails—prevents the expert data from dominating the gradient and allows the agent to learn from its own successes.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Proximal Policy Optimization (PPO) concepts
Vision-Language Models (VLMs)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—RL where success is determined by a deterministic verifier (e.g., checking if a file exists)

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same input, removing the need for a separate value function critic

GUI Agent: An AI agent that interacts with a computer via the Graphical User Interface (mouse, keyboard, screenshots) rather than APIs

End-to-End (E2E): A single model that maps inputs (pixels/text) directly to actions, without intermediate planners or specialized tools

Framework-based Agent: A system composed of multiple modules (planner, executor, tool user) to solve tasks, often more capable but complex

Self-rolling: The process of letting the policy execute a plan itself to generate a trajectory that is guaranteed to be within its own reachable state space

Covariate Shift: The difference in distribution between the training data (expert traces) and the data the model generates during its own operation