Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning

📝 Paper Summary

GUI Agents Visual Grounding Reinforcement Learning

SE-GUI improves how agents locate interface elements by using distance-based feedback and filtering training data based on whether the model's own attention maps focus on the correct regions.

Core Problem

Existing GUI agents rely on Supervised Fine-Tuning (SFT) which generalizes poorly to complex screens, while standard Reinforcement Learning (RL) struggles because binary success/failure rewards are too sparse for precise coordinate prediction.

Why it matters:

Sparse rewards (0/1) in early training mean incorrect predictions get identical zero feedback, preventing the model from learning 'near misses'
Automated grounding datasets often contain noise (e.g., hidden DOM elements), causing models to learn incorrect associations between instructions and visual locations
Current SFT methods require massive datasets yet fail to scale to high-resolution professional environments compared to human-like iterative learning

Concrete Example: If a model clicks slightly to the left of a button, a standard binary reward gives it a '0' (same as clicking the opposite side of the screen). SE-GUI gives a partial reward based on proximity. Additionally, if the training data has a label for a hidden button, standard methods train on noise, whereas SE-GUI detects its attention is unfocused and ignores the bad sample.

Key Novelty

Self-Evolutionary Reinforcement Fine-Tuning (SE-RFT)

Introduces a 'Dense Point Reward' that gives continuous feedback based on pixel distance to the target, rather than just success/failure, guiding the model toward precise coordinates
Implements a self-supervision loop where the model's own attention maps are analyzed; if the model doesn't focus on the target region during training, that specific sample is dynamically filtered out to prevent noise

Architecture

The overall SE-GUI framework including data curation, the RL training loop with dense rewards, and the attention-based filtering mechanism.

Evaluation Highlights

Achieves 47.3% accuracy on ScreenSpot-Pro, a challenging high-resolution benchmark
Outperforms the massive UI-TARS-72B model by a margin of 24.2% while using only a 7B parameter model
Attains state-of-the-art results using only 3,018 high-quality training samples, demonstrating extreme data efficiency compared to SFT

Breakthrough Assessment

8/10

Significant efficiency jump: beats a 72B model with a 7B model using only 3k samples. The attention-guided data filtering is a clever, methodologically sound way to handle noisy GUI data.

⚙️ Technical Details

Problem Definition

Setting: Visual Grounding in GUI environments (referring expression comprehension)

Inputs: A screenshot image and a natural language instruction

Outputs: Exact coordinate point (x, y) or bounding box of the target UI element

Pipeline Flow

Input (Screenshot + Instruction) -> VLM Encoder
Decoder -> Position Prediction
Self-Evolution Loop (Training only): Attention Map Extraction -> Filtering -> Loss Calculation

System Modules

Base VLM

Encodes visual and textual inputs and generates coordinate tokens

Model or implementation: Qwen2.5-VL-7B

Modeling

Base Model: Qwen2.5-VL-7B

Training Method: Group Relative Policy Optimization (GRPO) with Self-Evolutionary RFT

Objective Functions:

Purpose: Optimize policy to maximize dense rewards.

Formally: Maximize Advantage A_i based on normalized rewards (Format Reward + Point Reward) within a group of N outputs.
Purpose: Ensure training stability.

Formally: Minimize KL divergence between the current policy and the reference model policy.
Purpose: Filter noisy data using self-attention.

Formally: Zero out loss for samples where attention maps do not show significant activation (P_peak) or high global average (P_global) in the target bounding box region.

Training Data:

Initial pool: ~300k samples from ShowUI, UGround, AriaUI
Filtered down to 3,018 samples (SE-GUI-3k) using VLM-based quality scoring and difficulty checks

Key Hyperparameters:

training_sample_size: 3,018
reward_structure: Combination of Format Reward (alpha) and Dense Point Reward (beta)

Compute: Not reported in the paper

Comparison to Prior Work

vs. UI-TARS: SE-GUI uses RL with dense rewards instead of pure SFT, achieving higher accuracy with 10x fewer parameters (7B vs 72B)
vs. UI-R1/GUI-R1: SE-GUI uses continuous distance-based rewards instead of sparse binary (0/1) rewards to solve the 'cold start' problem in RL
vs. Standard SFT: SE-GUI actively filters its own training data using attention maps during the training loop

Limitations

Relies on the assumption that attention maps correlate perfectly with grounding intent; 'correct' predictions with 'incorrect' attention might be filtered out
Performance depends heavily on the quality of the initial seed data curation
No specific computational cost or training time reported for the RL phase

Reproducibility

Code availability is not provided in the paper text. The method relies on specific filtering thresholds (tau) and reward weights (alpha, beta) which are mentioned in formulas but exact numerical values for these hyperparameters are not explicitly detailed in the provided text.

📊 Experiments & Results

Evaluation Setup

Visual grounding on diverse GUI platforms (Desktop, Mobile, Web)

Benchmarks:

ScreenSpot-Pro (Complex/Professional GUI Grounding)
ShowUI (GUI Grounding)
UGround (GUI Grounding)

Metrics:

Accuracy (assumed based on % reporting)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ScreenSpot-Pro	Accuracy	23.1	47.3	+24.2

Main Takeaways

RL with dense rewards significantly outperforms larger models trained with SFT on complex GUI tasks.
Data quality matters more than quantity: 3k curated samples yielded SOTA results compared to models trained on much larger noisy datasets.
The self-evolutionary mechanism (filtering data based on attention) effectively removes noise where instructions do not align with visual elements.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically Policy Optimization)
Vision-Language Models (VLMs)
Attention Mechanisms in Transformers

Key Terms

Visual Grounding: The task of locating a specific object or element in an image based on a natural language description

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs against their mean, removing the need for a separate value network

SFT: Supervised Fine-Tuning—training a model on a fixed dataset of input-output pairs

DOM: Document Object Model—the underlying code structure of a web page, which often contains elements not actually visible to the user

VLM: Vision-Language Model—an AI model capable of understanding and generating content based on both image and text inputs

Dense Point Reward: A continuous reward signal calculated based on the normalized distance between a predicted point and the ground truth center, providing smoother gradients than binary rewards

IoU: Intersection over Union—a metric measuring the overlap between a predicted bounding box and the ground truth box

KL divergence: Kullback-Leibler divergence—a statistical distance measure used here to ensure the RL-tuned model doesn't drift too far from the original reference model