← Back to Paper List

GUI-G2: Gaussian Reward Modeling for GUI Grounding

Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
Zhejiang University
arXiv.org (2025)
MM RL Agent Benchmark

📝 Paper Summary

GUI Grounding GUI Agents Reinforcement Learning
GUI-G2 improves GUI agent training by replacing sparse binary rewards with dense, continuous Gaussian rewards that model human clicking behavior and scale with element size.
Core Problem
Current RL-based GUI agents use binary hit-or-miss rewards that provide no gradient signal for near-misses and ignore the continuous spatial nature of human interaction.
Why it matters:
  • Binary rewards create sparse feedback, making optimization inefficient, especially during early training when models rarely hit exact targets
  • Treating elements as dimensionless points contradicts Fitts' Law, which shows human clicking naturally follows continuous probability distributions
  • Existing methods struggle to generalize to unseen layouts or elements with varying scales due to rigid, discrete success criteria
Concrete Example: If a user needs to click a button located at [100, 100] to [150, 150], a prediction at [151, 151] (one pixel off) receives a reward of 0, indistinguishable from a prediction at [1000, 1000], providing no directional guidance to the model.
Key Novelty
Gaussian Continuous Reward Modeling
  • Models target GUI elements as 2D Gaussian distributions rather than binary bounding boxes, providing smooth, exponentially decaying gradients for predictions near the target
  • Incorporates an Adaptive Variance mechanism that scales the reward distribution's spread based on element dimensions, allowing larger spatial tolerance for big elements while demanding precision for small icons
Architecture
Architecture Figure Figure 3
Conceptual diagram of the GUI-G2 reward framework, contrasting binary rewards with the proposed Gaussian formulation.
Evaluation Highlights
  • Achieves 47.5% accuracy on ScreenSpot-Pro, outperforming state-of-the-art UI-TARS-72B by 24.7 percentage points
  • Reaches 92.0% accuracy on ScreenSpot benchmark, a 4.1% improvement over baselines
  • Attains 93.3% accuracy on ScreenSpot-v2, demonstrating robustness across diverse interface layouts
Breakthrough Assessment
8/10
Proposes a principled, human-behavior-aligned reward formulation (Gaussian) that fundamentally addresses the sparsity problem in GUI RL, yielding massive gains (+24.7%) on difficult benchmarks.
×