← Back to Paper List

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baoling Peng, Huan Zhang, Jianfeng Gao, Tong Zhang
University of Illinois Urbana-Champaign, Microsoft, University of North Carolina at Chapel Hill
arXiv (2026)
MM Agent RL Reasoning

📝 Paper Summary

Native GUI Agents Vision-Language Model Post-training Reinforcement Learning from Verifiable Rewards (RLVR)
GUI-Libra improves native GUI agents by balancing reasoning and grounding via action-weighted supervision and stabilizing reinforcement learning against ambiguous rewards using conservative regularization.
Core Problem
Standard post-training fails for GUI agents because long reasoning traces (CoT) degrade visual grounding accuracy, and step-wise RL suffers from partial verifiability where valid actions are penalized if they don't match the specific demonstration.
Why it matters:
  • Open-source native agents lag behind proprietary systems in long-horizon tasks requiring both high-level planning and pixel-perfect execution
  • Current RLVR methods (successful in math) fail in GUIs because 'correctness' is ambiguous—many paths lead to the same goal, but datasets typically verify only one
  • Implicit trade-off: models trained to reason extensively often lose the ability to output precise coordinates (grounding)
Concrete Example: In a navigation task, both clicking a 'Search' icon and typing in a 'Menu' bar might be valid next steps. Because the offline dataset only records the 'Search' click, a standard RL agent that chooses the 'Menu' bar receives a negative reward (failure), confusing the policy with false negative signals.
Key Novelty
Action-Aware Supervision & Conservative Partial-Verify RL
  • Action-Aware SFT (ASFT): Explicitly reweights loss functions to prioritize action and coordinate tokens over reasoning tokens, preventing 'thought' generation from overwhelming execution capability
  • Conservative RL: Reintroduces KL regularization (contrary to recent RLVR trends) to prevent policy drift under ambiguous rewards
  • Success-Adaptive Scaling: Downweights gradients for 'negative' samples in RL if the agent's path was actually valid or ambiguous, reducing the impact of false negatives due to partial verifiability
Architecture
Architecture Figure Figure 2 (implied from text description of pipeline)
The data construction and training pipeline. Shows the flow from raw open-source data -> cleaning/filtering -> GUI-Libra-81K -> Action-Aware SFT -> Partially Verifiable RL.
Evaluation Highlights
  • +15.6% success rate improvement on AndroidWorld for GUI-Libra-4B over its base model (Qwen2-VL-2B)
  • +12.2% success rate improvement on AndroidWorld for GUI-Libra-8B over its base model (Qwen2-VL-7B)
  • +8.7% success rate improvement on Online-Mind2Web for GUI-Libra-8B, narrowing the gap with closed-source systems
Breakthrough Assessment
8/10
Strong empirical results on challenging online benchmarks (AndroidWorld) and a thoughtful methodological correction to how RLVR is applied to GUIs (handling partial verifiability). The release of a curated 81K dataset is a significant resource contribution.
×