The Hong Kong University of Science and Technology,
Alibaba Group,
The Chinese University of Hong Kong,
Nanjing University of Science and Technology
arXiv.org
(2025)
MMAgentRLBenchmark
📝 Paper Summary
GUI AgentsVisual GroundingPreference Optimization
LPO improves the spatial precision of GUI agents by optimizing a multimodal policy using Group Relative Preference Optimization (GRPO) driven by novel pixel-entropy and physical-distance rewards.
Core Problem
Supervised Fine-Tuning (SFT) limits GUI agents' spatial perception, while existing RL methods rely on coarse, static thresholds (e.g., bounding boxes) that fail to distinguish varying degrees of positional accuracy.
Why it matters:
Precise interaction (e.g., clicking exact coordinates) is fundamental for autonomous agents to function correctly in complex user interfaces.
Current methods like UI-TARS rely on labor-intensive manual data construction for preference optimization.
Static decision boundaries in prior RL work offer only coarse evaluations, leading to imprecise localization where 'close' is treated the same as 'perfect'.
Concrete Example:In current approaches, a click 5 pixels away from a button center and a click 20 pixels away might both be classified as 'success' if within a fixed bounding box threshold, or 'fail' if outside. LPO uses continuous distance rewards, incentivizing the agent to correct the 20-pixel error down to 0 pixels for higher reward.
Key Novelty
Location Preference Optimization (LPO)
Uses 'Window-based Information Density' to guide agents toward information-rich areas (buttons/text) by calculating pixel entropy within grid segments.
Implements a 'Dynamic Location Reward' based on Euclidean distance to target coordinates, providing granular feedback on spatial precision rather than binary success/fail.
Integrates these rewards into a Group Relative Preference Optimization (GRPO) framework to explore the GUI environment without needing manually paired preference data.
Architecture
Comparison of different preference optimization strategies (RL with fixed rewards vs. LPO with dynamic rewards) and the proposed reward mechanism.
Breakthrough Assessment
7/10
Introduces a logical, physically grounded reward mechanism for GUI agents that moves beyond binary hit/miss metrics. While the methodology is sound, the text provided does not contain the experimental results to verify the magnitude of improvement.
⚙️ Technical Details
Problem Definition
Setting: Markov Decision Process (MDP) where an agent interacts with a GUI to maximize cumulative reward.
Inputs: RGB screenshot s_t and instruction I
Outputs: Action a_t consisting of action type A_t (click, drag, etc.) and coordinates E_t
Pipeline Flow
Agent observes State s_t (Image) -> Generates Group of Actions {a_g}
Purpose: Reward interactions in high-entropy (information-rich) windows.
Formally: R_entropy = H(W_{i*,j*}) / Max(H)
Purpose: Reward spatial precision via Euclidean distance.
Formally: R_location = 1 - (Distance / d_max)
Key Hyperparameters:
epsilon (entropy stability): 1e-6
d_max (distance scaling): 1000
M x N (grid resolution): Matches MLLM settings (exact numbers not reported)
Comparison to Prior Work
vs. UI-TARS: LPO does not require manual construction of preference pairs; it uses GRPO with automatic rewards.
vs. Threshold-based RL: LPO uses a continuous 'Dynamic Location Reward' based on distance, offering finer granularity than binary thresholding.
vs. DPO methods: LPO leverages group relative sampling (GRPO) to explore the environment more broadly than paired DPO.
Limitations
Relies on the assumption that interaction targets (buttons) always have higher information entropy than background, which may not hold for flat/minimalist designs.
Requires ground truth target coordinates for the Dynamic Location Reward calculation during training.
The window-based approach divides the screen into a fixed grid, potentially splitting a single UI element across two windows.
Code is promised to be publicly available at https://github.com/AIDC-AI/LPO. The paper describes the mathematical formulation of rewards clearly, but exact grid dimensions (M, N) and base model architecture are not specified in the provided text.
📊 Experiments & Results
Evaluation Setup
GUI interaction and visual grounding tasks across offline benchmarks and online environments.
Benchmarks:
Multimodal Mind2Web (Offline GUI Interaction)
VisualWebBench (Grounding)
Screenspot V2 (Grounding)
WebVoyager (Online Real-world GUI Interaction)
Metrics:
Interaction Precision (implied)
Success Rate (implied)
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
The paper claims LPO achieves State-of-the-Art (SOTA) performance on both offline benchmarks (Mind2Web, Screenspot V2) and online evaluations (WebVoyager).
The method demonstrates that combining information entropy (for coarse region finding) and physical distance (for fine precision) effectively optimizes GUI agents.
LPO purportedly outperforms existing RL strategies that rely on static decision boundaries or manual preference pair construction.
📚 Prerequisite Knowledge
Prerequisites
Reinforcement Learning (MDP, Rewards)
Graphical User Interface (GUI) Agents
Information Entropy
Key Terms
GRPO: Group Relative Preference Optimization—an RL algorithm that updates policies by comparing the relative advantages of a group of sampled actions rather than using a separate critic model.
SFT: Supervised Fine-Tuning—training a model on labeled examples of correct behavior.
Information Entropy: A measure of the uncertainty or information density in a signal; used here to identify screen regions with high visual complexity (likely containing UI elements).
GUI Grounding: The ability of an AI agent to map textual descriptions or intentions to specific spatial coordinates (x, y) on a user interface.
DPO: Direct Preference Optimization—a method to align models using pairs of preferred and dispreferred outputs.