GUI-G2: Gaussian Reward Modeling for GUI Grounding

📝 Paper Summary

GUI Grounding GUI Agents Reinforcement Learning

GUI-G2 improves GUI agent training by replacing sparse binary rewards with dense, continuous Gaussian rewards that model human clicking behavior and scale with element size.

Core Problem

Current RL-based GUI agents use binary hit-or-miss rewards that provide no gradient signal for near-misses and ignore the continuous spatial nature of human interaction.

Why it matters:

Binary rewards create sparse feedback, making optimization inefficient, especially during early training when models rarely hit exact targets
Treating elements as dimensionless points contradicts Fitts' Law, which shows human clicking naturally follows continuous probability distributions
Existing methods struggle to generalize to unseen layouts or elements with varying scales due to rigid, discrete success criteria

Concrete Example: If a user needs to click a button located at [100, 100] to [150, 150], a prediction at [151, 151] (one pixel off) receives a reward of 0, indistinguishable from a prediction at [1000, 1000], providing no directional guidance to the model.

Key Novelty

Gaussian Continuous Reward Modeling

Models target GUI elements as 2D Gaussian distributions rather than binary bounding boxes, providing smooth, exponentially decaying gradients for predictions near the target
Incorporates an Adaptive Variance mechanism that scales the reward distribution's spread based on element dimensions, allowing larger spatial tolerance for big elements while demanding precision for small icons

Architecture

Conceptual diagram of the GUI-G2 reward framework, contrasting binary rewards with the proposed Gaussian formulation.

Evaluation Highlights

Achieves 47.5% accuracy on ScreenSpot-Pro, outperforming state-of-the-art UI-TARS-72B by 24.7 percentage points
Reaches 92.0% accuracy on ScreenSpot benchmark, a 4.1% improvement over baselines
Attains 93.3% accuracy on ScreenSpot-v2, demonstrating robustness across diverse interface layouts

Breakthrough Assessment

8/10

Proposes a principled, human-behavior-aligned reward formulation (Gaussian) that fundamentally addresses the sparsity problem in GUI RL, yielding massive gains (+24.7%) on difficult benchmarks.

⚙️ Technical Details

Problem Definition

Setting: GUI Grounding via Reinforcement Learning

Inputs: Screenshot s and natural language instruction i

Outputs: Predicted bounding box coordinates b_p = [x1, y1, x2, y2]

Pipeline Flow

Input Processing (Instruction + Screenshot)
Policy Network (VLM Inference)
Reward Calculation (Gaussian Point + Coverage)

System Modules

Policy Network

Predict bounding box coordinates from multimodal input

Model or implementation: Multimodal LLM (e.g., UI-TARS or similar base)

Reward Engine

Compute dense continuous rewards for RL updates

Model or implementation: Gaussian Function + Bhattacharyya Coefficient

Modeling

Base Model: Not explicitly specified in snippet (Likely UI-TARS or Qwen-VL based given context)

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Reward precise localization of the element center.

Formally: R_point = exp(-(mu_p - mu_gt)^T Sigma^-1 (mu_p - mu_gt)), where mu are centers and Sigma is adaptive covariance.
Purpose: Reward spatial overlap between predicted and target regions.

Formally: R_cover = Bhattacharyya_Coefficient(Gaussian_p, Gaussian_gt).
Purpose: Scale reward variance based on element size.

Formally: sigma_x = alpha * (x2 - x1), sigma_y = alpha * (y2 - y1).

Key Hyperparameters:

alpha: Scaling factor for adaptive variance (value not in snippet)
nu: Weight for point reward (value not in snippet)
gamma: Weight for coverage reward (value not in snippet)

Compute: Not reported in the paper

Comparison to Prior Work

vs. UI-TARS-72B: Uses RL with continuous rewards instead of just SFT, achieving significantly higher accuracy on complex layouts
vs. GUI-R1: Replaces discrete hit-or-miss rewards with continuous Gaussian feedback, providing denser gradients for optimization

Limitations

Relies on the assumption that element interactions are well-modeled by independent X/Y Gaussian distributions
Computational cost of RL training is typically higher than supervised fine-tuning
Effectiveness depends on the quality of the underlying VLM backbone

Reproducibility

No replication artifacts mentioned in the paper. The paper relies on public benchmarks (ScreenSpot series) but does not provide code or model weights in the text.

📊 Experiments & Results

Evaluation Setup

GUI Grounding (Mapping instructions to bounding boxes) on mobile/desktop interfaces

Benchmarks:

ScreenSpot (GUI Grounding (General))
ScreenSpot-v2 (GUI Grounding (Expanded))
ScreenSpot-Pro (GUI Grounding (Complex/Diverse))

Metrics:

Accuracy (percentage of predictions with center inside ground truth box)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

GUI-G2 achieves substantial improvements over SOTA (UI-TARS-72B), with a massive 24.7% accuracy gain on the difficult ScreenSpot-Pro benchmark.
Continuous Gaussian rewards provide superior robustness to interface variations compared to binary rewards, as evidenced by consistent gains across all three ScreenSpot versions.
The adaptive variance mechanism is critical for handling the diverse scales of GUI elements, from small icons to large panels.
Modeling interactions as continuous probability distributions aligns better with human behavior (Fitts' Law) than discrete classification approaches.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Optimization)
Gaussian Distributions
GUI Grounding concepts (Bounding Boxes, IoU)

Key Terms

GUI Grounding: The task of mapping a natural language command (e.g., 'click the search bar') to specific pixel coordinates on a screen

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes advantages among a group of sampled outputs to stabilize training

Fitts' Law: A model of human movement stating that the time to acquire a target is a function of the distance to and size of the target, implying spatial interactions follow probability distributions

Bhattacharyya coefficient: A statistical measure used to quantify the similarity or overlap between two probability distributions (used here for coverage rewards)

Binary reward: A sparse reward signal that is 1 if the action is successful (inside the box) and 0 otherwise, offering no partial credit

Dense reward: A continuous reward signal that provides feedback on *how close* an action was to success, guiding the model even when it misses