GUI-G2 improves GUI agent training by replacing sparse binary rewards with dense, continuous Gaussian rewards that model human clicking behavior and scale with element size.
Core Problem
Current RL-based GUI agents use binary hit-or-miss rewards that provide no gradient signal for near-misses and ignore the continuous spatial nature of human interaction.
Why it matters:
Binary rewards create sparse feedback, making optimization inefficient, especially during early training when models rarely hit exact targets
Treating elements as dimensionless points contradicts Fitts' Law, which shows human clicking naturally follows continuous probability distributions
Existing methods struggle to generalize to unseen layouts or elements with varying scales due to rigid, discrete success criteria
Concrete Example:If a user needs to click a button located at [100, 100] to [150, 150], a prediction at [151, 151] (one pixel off) receives a reward of 0, indistinguishable from a prediction at [1000, 1000], providing no directional guidance to the model.
Key Novelty
Gaussian Continuous Reward Modeling
Models target GUI elements as 2D Gaussian distributions rather than binary bounding boxes, providing smooth, exponentially decaying gradients for predictions near the target
Incorporates an Adaptive Variance mechanism that scales the reward distribution's spread based on element dimensions, allowing larger spatial tolerance for big elements while demanding precision for small icons
Architecture
Conceptual diagram of the GUI-G2 reward framework, contrasting binary rewards with the proposed Gaussian formulation.
Evaluation Highlights
Achieves 47.5% accuracy on ScreenSpot-Pro, outperforming state-of-the-art UI-TARS-72B by 24.7 percentage points
Reaches 92.0% accuracy on ScreenSpot benchmark, a 4.1% improvement over baselines
Attains 93.3% accuracy on ScreenSpot-v2, demonstrating robustness across diverse interface layouts
Breakthrough Assessment
8/10
Proposes a principled, human-behavior-aligned reward formulation (Gaussian) that fundamentally addresses the sparsity problem in GUI RL, yielding massive gains (+24.7%) on difficult benchmarks.
⚙️ Technical Details
Problem Definition
Setting: GUI Grounding via Reinforcement Learning
Inputs: Screenshot s and natural language instruction i
alpha: Scaling factor for adaptive variance (value not in snippet)
nu: Weight for point reward (value not in snippet)
gamma: Weight for coverage reward (value not in snippet)
Compute: Not reported in the paper
Comparison to Prior Work
vs. UI-TARS-72B: Uses RL with continuous rewards instead of just SFT, achieving significantly higher accuracy on complex layouts
vs. GUI-R1: Replaces discrete hit-or-miss rewards with continuous Gaussian feedback, providing denser gradients for optimization
Limitations
Relies on the assumption that element interactions are well-modeled by independent X/Y Gaussian distributions
Computational cost of RL training is typically higher than supervised fine-tuning
Effectiveness depends on the quality of the underlying VLM backbone
Reproducibility
No replication artifacts mentioned in the paper. The paper relies on public benchmarks (ScreenSpot series) but does not provide code or model weights in the text.
📊 Experiments & Results
Evaluation Setup
GUI Grounding (Mapping instructions to bounding boxes) on mobile/desktop interfaces
Benchmarks:
ScreenSpot (GUI Grounding (General))
ScreenSpot-v2 (GUI Grounding (Expanded))
ScreenSpot-Pro (GUI Grounding (Complex/Diverse))
Metrics:
Accuracy (percentage of predictions with center inside ground truth box)
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
GUI-G2 achieves substantial improvements over SOTA (UI-TARS-72B), with a massive 24.7% accuracy gain on the difficult ScreenSpot-Pro benchmark.
Continuous Gaussian rewards provide superior robustness to interface variations compared to binary rewards, as evidenced by consistent gains across all three ScreenSpot versions.
The adaptive variance mechanism is critical for handling the diverse scales of GUI elements, from small icons to large panels.
Modeling interactions as continuous probability distributions aligns better with human behavior (Fitts' Law) than discrete classification approaches.
📚 Prerequisite Knowledge
Prerequisites
Reinforcement Learning (Policy Optimization)
Gaussian Distributions
GUI Grounding concepts (Bounding Boxes, IoU)
Key Terms
GUI Grounding: The task of mapping a natural language command (e.g., 'click the search bar') to specific pixel coordinates on a screen
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes advantages among a group of sampled outputs to stabilize training
Fitts' Law: A model of human movement stating that the time to acquire a target is a function of the distance to and size of the target, implying spatial interactions follow probability distributions
Bhattacharyya coefficient: A statistical measure used to quantify the similarity or overlap between two probability distributions (used here for coverage rewards)
Binary reward: A sparse reward signal that is 1 if the action is successful (inside the box) and 0 otherwise, offering no partial credit
Dense reward: A continuous reward signal that provides feedback on *how close* an action was to success, guiding the model even when it misses