GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

📝 Paper Summary

Native GUI Agents Vision-Language Model Post-training Reinforcement Learning from Verifiable Rewards (RLVR)

GUI-Libra improves native GUI agents by balancing reasoning and grounding via action-weighted supervision and stabilizing reinforcement learning against ambiguous rewards using conservative regularization.

Core Problem

Standard post-training fails for GUI agents because long reasoning traces (CoT) degrade visual grounding accuracy, and step-wise RL suffers from partial verifiability where valid actions are penalized if they don't match the specific demonstration.

Why it matters:

Open-source native agents lag behind proprietary systems in long-horizon tasks requiring both high-level planning and pixel-perfect execution
Current RLVR methods (successful in math) fail in GUIs because 'correctness' is ambiguous—many paths lead to the same goal, but datasets typically verify only one
Implicit trade-off: models trained to reason extensively often lose the ability to output precise coordinates (grounding)

Concrete Example: In a navigation task, both clicking a 'Search' icon and typing in a 'Menu' bar might be valid next steps. Because the offline dataset only records the 'Search' click, a standard RL agent that chooses the 'Menu' bar receives a negative reward (failure), confusing the policy with false negative signals.

Key Novelty

Action-Aware Supervision & Conservative Partial-Verify RL

Action-Aware SFT (ASFT): Explicitly reweights loss functions to prioritize action and coordinate tokens over reasoning tokens, preventing 'thought' generation from overwhelming execution capability
Conservative RL: Reintroduces KL regularization (contrary to recent RLVR trends) to prevent policy drift under ambiguous rewards
Success-Adaptive Scaling: Downweights gradients for 'negative' samples in RL if the agent's path was actually valid or ambiguous, reducing the impact of false negatives due to partial verifiability

Architecture

The data construction and training pipeline. Shows the flow from raw open-source data -> cleaning/filtering -> GUI-Libra-81K -> Action-Aware SFT -> Partially Verifiable RL.

Evaluation Highlights

+15.6% success rate improvement on AndroidWorld for GUI-Libra-4B over its base model (Qwen2-VL-2B)
+12.2% success rate improvement on AndroidWorld for GUI-Libra-8B over its base model (Qwen2-VL-7B)
+8.7% success rate improvement on Online-Mind2Web for GUI-Libra-8B, narrowing the gap with closed-source systems

Breakthrough Assessment

8/10

Strong empirical results on challenging online benchmarks (AndroidWorld) and a thoughtful methodological correction to how RLVR is applied to GUIs (handling partial verifiability). The release of a curated 81K dataset is a significant resource contribution.

⚙️ Technical Details

Problem Definition

Setting: Goal-conditioned partially observable Markov decision process (POMDP) for GUI interaction

Inputs: Instruction ℓ, Interaction history h_t, Current observation (screenshot) o_t

Outputs: Action a_t (operation + arguments/coordinates)

Pipeline Flow

Input Processing: Instruction + History + Screenshot
Native VLM: Encodes vision and text -> Generates Reasoning Trace -> Generates Action
Action Execution: Action executed on Android/Web environment

System Modules

Native VLM

Jointly processes visual and textual context to output thoughts and executable actions

Model or implementation: Qwen2-VL (2B or 7B variants)

Novel Architectural Elements

Integrated Action-Aware Loss: The loss function dynamically reweights tokens based on whether they are part of the 'reasoning' or 'action' segment during SFT [Architecture-level modification of the training objective implementation]

Modeling

Base Model: Qwen2-VL-2B and Qwen2-VL-7B

Training Method: Action-aware Supervised Fine-Tuning (ASFT) followed by Conservative RL (GRPO + KL)

Objective Functions:

Purpose: SFT Loss.

Formally: Minimize negative log-likelihood with token-wise weights w_t, where w_t is higher for action tokens and lower for reasoning tokens.
Purpose: RL Policy Update (GRPO).

Formally: Maximize group-relative advantage A_k with KL penalty term β * KL(π || π_ref).
Purpose: Success-Adaptive Negative Scaling (SANS).

Formally: Scale negative advantages by factor λ < 1 if the trajectory outcome is ambiguous/noisy.

Training Data:

GUI-Libra-81K: Curated from 7 open-source datasets (e.g., AndroidControl, GUI-Odyssey, AMEX)
Filtered to remove incomplete traces, extreme lengths (<3 or >50 steps), and compound actions
Includes 81K high-quality GUI reasoning samples

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
kl_coefficient_beta: Non-zero (explicitly maintained, unlike recent RLVR trends)
+ 1 more
negative_gradient_scale_lambda: Scalar < 1 (adaptive)

Compute: Not reported in the paper

Comparison to Prior Work

vs. UI-TARS: GUI-Libra explicitly uses KL regularization and partial verifiability handling, whereas typical RLVR often drops KL.
vs. Aguvis: GUI-Libra uses a rigorous data filtering pipeline and action-aware loss to prevent reasoning from hurting grounding.
vs. Recent RLVR (e.g., generic math RL): GUI-Libra retains KL regularization to handle the ambiguity of GUI rewards, whereas math RL often drops KL for pure performance.
+ 1 more
vs. SeeClick [not cited in paper]: SeeClick focuses only on grounding (SFT); GUI-Libra is a native agent handling full reasoning and multi-step tasks via RL.

Limitations

Relies on existing open-source datasets for the initial pool; quality is bounded by the raw data sources.
Partial verifiability solution (SANS) reduces bias but does not fully solve the false negative problem in offline RL.
Evaluation focuses on navigation; content generation or creative tasks are not the primary focus.

Reproducibility

Code: https://gui-libra.github.io

The authors release the curated GUI-Libra-81K dataset, code, and trained models (GUI-Libra-4B/8B) at https://gui-libra.github.io. Specific training hyperparameters (LR, batch size) are not detailed in the summary text provided.

📊 Experiments & Results

Evaluation Setup

Evaluation across diverse web and mobile platforms using both offline (static dataset) and online (interactive simulator) benchmarks.

Benchmarks:

AndroidWorld (Online mobile device control)
WebArena-Lite-v2 (Online web browsing agents)
Online-Mind2Web (Online web agents)
AITW (Offline Android task navigation)
Mind2Web (Offline web task navigation)

Metrics:

Success Rate (SR)
Step-wise Action Accuracy
Grounding Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on the AndroidWorld benchmark showing significant gains over base models via the GUI-Libra post-training recipe.
AndroidWorld	Success Rate	26.8	42.4	+15.6
AndroidWorld	Success Rate	Not reported in the paper	Not reported in the paper	+12.2
Results on Online-Mind2Web showing improvements in web navigation tasks.
Online-Mind2Web	Success Rate	Not reported in the paper	Not reported in the paper	+4.0
Online-Mind2Web	Success Rate	Not reported in the paper	Not reported in the paper	+8.7
Results on WebArena-Lite-v2 demonstrating consistent gains across different web environments.
WebArena-Lite-v2	Success Rate	Not reported in the paper	Not reported in the paper	+12.5
WebArena-Lite-v2	Success Rate	Not reported in the paper	Not reported in the paper	+11.3

Main Takeaways

Consistent double-digit improvements on AndroidWorld and WebArena benchmarks demonstrate that the post-training recipe is robust across both mobile and web domains.
Conservative RL (with KL regularization) is critical for GUI agents; unlike math reasoning where dropping KL helps, GUI agents require it to maintain grounding stability under ambiguous rewards.
Careful data filtering and action-aware supervision allow smaller open-source models (4B/8B) to perform competitively without expensive online data collection.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs)
Reinforcement Learning from Verifiable Rewards (RLVR)
Supervised Fine-Tuning (SFT)
Proximal Policy Optimization / GRPO

Key Terms

Native GUI Agents: End-to-end models that map instructions and screenshots directly to executable actions without external planners

Partial Verifiability: A characteristic of GUI tasks where multiple valid actions exist for a state, but offline data verifies only one, causing ambiguity in reward assignment

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a sampled group of outputs to reduce variance

KL Regularization: A penalty term enforcing the trained policy to stay close to a reference policy (usually the SFT model) to prevent mode collapse or drift

Action-aware SFT (ASFT): A fine-tuning strategy that assigns higher loss weights to action/grounding tokens and lower weights to reasoning tokens to preserve execution precision

Grounding: The ability of the model to map semantic intent (e.g., 'click the button') to precise screen coordinates

Chain-of-Thought (CoT): Intermediate reasoning steps generated by the model before the final action

Success-Adaptive Scaling: A technique to downweight the learning signal from negative samples in RL when the reward signal is unreliable (ambiguous)