Shuquan Lian, Yuhang Wu, Jianghong Ma, Zihan Song, Bin Chen, Xiawu Zheng, Hui Li
arXiv.org
(2025)
MMAgentRLReasoningBenchmark
📝 Paper Summary
GUI AgentsMultimodal Reinforcement Learning
UI-AGILE enhances GUI agents by training with continuous distance-based rewards and length-constrained reasoning, while using a tile-based inference strategy to handle high-resolution visual noise.
Core Problem
GUI agents struggle with a reasoning dilemma (thinking hurts grounding/latency vs. no-thinking hurts planning), ineffective binary rewards that fail to teach precise clicking, and visual noise on high-resolution screens.
Why it matters:
Elaborate reasoning processes often degrade grounding accuracy and increase latency, making agents slow and less precise
Simple binary rewards (success/fail) provide sparse feedback on complex tasks and do not incentivize clicking the semantic center of an element
Concrete Example:On a 3840x2160 screen, converting the image to tokens results in over 10,000 tokens, mostly irrelevant background. A standard agent processing this full image fails to locate a small button due to noise. In preliminary tests, simply cropping the image to 1024x1024 improved UGround-V1-7B's accuracy from 31.6 to 56.0.
Key Novelty
UI-AGILE (Training & Inference Framework)
**Simple Thinking (Training):** A reward function that encourages reasoning chains of moderate length—penalizing both 'under-thinking' and 'over-thinking'—to balance planning capability with grounding accuracy.
**Continuous Grounding Reward (Training):** Replaces binary success/fail rewards with a continuous score based on the Chebyshev distance to the target center, incentivizing precise localization.
**Decomposed Grounding (Inference):** Splits high-resolution screens into sub-images to reduce noise, generates candidates per sub-image, and uses a VLM to select the best match via a Yes/No Q&A process.
Architecture
Overview of the UI-AGILE framework covering both Training (left) and Inference (right) stages.
Evaluation Highlights
Achieves 23% grounding accuracy improvement over the best baseline on the ScreenSpot-Pro benchmark when using both training and inference enhancements
Preliminary controlled experiments show that reducing visual noise via cropping improves UGround-V1-7B accuracy from 31.6 to 56.0 (+24.4 points)
Efficient training requiring only ~9,000 samples and 2 epochs to achieve superior performance
Breakthrough Assessment
8/10
Addresses critical bottlenecks in GUI agents (visual noise on 4K screens and sparse rewards) with practical, effective solutions. The inference decomposition strategy is a plug-and-play enhancement applicable to existing models.
⚙️ Technical Details
Problem Definition
Setting: GUI Agent Grounding and Navigation
Inputs: User instruction and a high-resolution screenshot
Outputs: Precise coordinates (x, y) for the target UI element
Pipeline Flow
Decomposition (Split Screenshot)
Candidate Generation (Agent Prediction)
Element Image Extraction (Cropping)
Selection (VLM Adjudication)
System Modules
Image Decomposer
Divides high-resolution screenshots into multiple overlapping sub-images to reduce token count and visual noise per forward pass
Model or implementation: Deterministic image processing
Candidate Generator
Predicts target coordinates independently on each sub-image
Model or implementation: GUI Agent (MLLM trained with UI-AGILE method)
Selector
Scores candidate element images to identify the best match for the instruction
Model or implementation: VLM (Vision-Language Model)
Novel Architectural Elements
Decomposed grounding with selection: A multi-stage inference pipeline that replaces single-pass prediction with a divide-and-conquer approach (Decompose -> Predict -> Select) to handle high-resolution inputs
Modeling
Base Model: Multimodal Large Language Model (specific architecture not detailed in snippet, compatible with existing MLLMs)
Training Method: Reinforcement Fine-Tuning (RFT) using GRPO
Objective Functions:
Purpose: Encourage concise reasoning thoughts.
Formally: R_think = I(R_grounding > 0) * R_length(L) + R_bonus, where R_length is a non-linear reward maximizing at an ideal range and R_bonus rewards syntactic completeness.
Purpose: Incentivize precise clicking at the element center.
Formally: R(x,y) = 1 - d_norm( (x,y), Center_bbox ), where d_norm is the normalized Chebyshev distance.
Training Data:
Cropping-based Resampling: Dynamically adjusts difficulty by cropping samples that yield zero reward (sparse reward mitigation)
Key Hyperparameters:
training_samples: Approx. 9,000
epochs: 2
Compute: Not reported in the paper
Comparison to Prior Work
vs. UI-R1/GUI-R1: Uses continuous Chebyshev rewards instead of binary rewards to teach precision; uses length-regulated 'Simple Thinking' instead of unconstrained reasoning.
vs. UGround: Introduces inference-time decomposition to handle high-res screens where UGround struggles with visual noise.
vs. Standard SFT: Uses cropping-based resampling curriculum to learn from hard samples that initially yield zero reward.
Code is provided at https://github.com/KDEGroup/UI-AGILE. The paper details the reward function formulas and the cropping algorithms (Alg. 1). Specific model weights or base checkpoints are not explicitly named in the text snippet.
📊 Experiments & Results
Evaluation Setup
GUI Navigation and Grounding on screenshots
Benchmarks:
ScreenSpot-Pro (GUI Element Grounding)
ScreenSpot-v2 (GUI Element Grounding)
Metrics:
Grounding Accuracy
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Preliminary controlled experiments validate the hypothesis that visual noise reduction (via cropping) significantly improves grounding performance.
ScreenSpot-Pro (Modified)
Grounding Accuracy
31.6
56.0
+24.4
Experiment Figures
Illustration of the scanning approach for cropping-based resampling.
Main Takeaways
UI-AGILE achieves state-of-the-art grounding performance on ScreenSpot-Pro and ScreenSpot-v2.
The combination of training enhancements (Simple Thinking, Continuous Reward) and inference enhancements (Decomposed Grounding) yields a 23% improvement over the best baseline on ScreenSpot-Pro.
Visual noise on high-resolution screens is a major bottleneck; cropping/decomposition strategies significantly alleviate this, turning 'impossible' high-res tasks into solvable low-res ones.
Continuous rewards based on Chebyshev distance are more effective than binary rewards for training agents to click the semantic center of UI elements.
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used here to fine-tune the agent
RFT: Reinforcement Fine-Tuning—applying RL after supervised fine-tuning to further align model behavior
Chebyshev distance: A distance metric (L-infinity norm) where the distance between two points is the greatest of their differences along any coordinate dimension; produces square reward contours matching GUI bounding boxes
SFT: Supervised Fine-Tuning—training the model on labeled instruction-action pairs
VLM: Vision-Language Model—a model capable of processing both image and text inputs
Visual Noise: Irrelevant visual information (pixels/tokens) in high-resolution screenshots that distracts the model from the target element
IoU: Intersection over Union—a standard metric for measuring the overlap between predicted and ground-truth bounding boxes