InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization

📝 Paper Summary

GUI Grounding (Graphical User Interface) Vision-Language RL

InfiGUI-G1 improves GUI agent accuracy by forcing models to generate multiple diverse coordinate candidates during training, rewarded by a novel efficiency-based function that penalizes lazy, linear scanning.

Core Problem

Standard Reinforcement Learning with Verifiable Rewards (RLVR) suffers from a 'confidence trap,' where models repeatedly sample high-confidence but incorrect actions, failing to explore semantically correct alternatives.

Why it matters:

Inefficient exploration bottlenecks 'semantic alignment,' preventing agents from associating abstract icons with their correct functions
Current methods rely on Supervised Fine-Tuning (data-intensive/poor generalization) or standard RL (gets stuck in local optima), limiting reliability in complex, real-world GUIs

Concrete Example: Given the instruction 'Use the camera to search,' a model might confidently click a generic 'Camera' icon. Standard RL keeps reinforcing this incorrect high-confidence action, never discovering the correct 'Google Lens' icon nearby because it rarely samples the tail of the probability distribution.

Key Novelty

Adaptive Exploration Policy Optimization (AEPO)

Forces the model to output multiple coordinate guesses in a single pass (Multi-Answer Generation) to uncover correct actions hidden in the tail of the probability distribution
Guides learning with an Adaptive Exploration Reward (AER) derived from efficiency principles (Utility/Cost), rewarding the model more for finding the correct answer early in its list of guesses
Applies a 'quality-of-exploration' penalty if the generated points form a straight line (collinear), ensuring the agent explores the 2D space diversely rather than just scanning linearly

Architecture

The Adaptive Exploration Policy Optimization (AEPO) framework workflow during training.

Evaluation Highlights

Achieves up to 9.0% relative improvement against the Naive RLVR baseline on benchmarks testing generalization and semantic understanding
Establishes new state-of-the-art results among open-source models (3B and 7B) on MMBench-GUI, ScreenSpot-Pro, and UI-Vision
Demonstrates higher exploration success in a single pass (with ~2 answers) than a Naive RLVR baseline allowed 4 independent attempts (pass@4)

Breakthrough Assessment

8/10

Addresses a fundamental RL exploration problem in VLM agents with a theoretically grounded reward function. Significant gains on diverse benchmarks with modest model sizes (3B/7B).

⚙️ Technical Details

Problem Definition

Setting: Policy optimization for GUI Grounding

Inputs: Context c = (GUI Screenshot S, Natural Language Instruction I)

Outputs: Action a = Coordinate point p(x, y)

Pipeline Flow

Vision Encoder & LLM (Process screenshot + instruction)
Multi-Answer Generator (Output N coordinate points)
Reward Calculation (Check correctness, rank, and collinearity)
Policy Update (RLOO)

System Modules

Backbone MLLM

Process visual and textual context to generate action sequences

Model or implementation: Qwen2.5-VL-3B-Instruct / Qwen2.5-VL-7B-Instruct

Exploration Evaluator (Training / Reward)

Compute the Adaptive Exploration Reward (AER) based on efficiency

Model or implementation: Deterministic Function

Geometry Checker (Training / Reward)

Detect and penalize low-quality linear scanning strategies

Model or implementation: Deterministic Function

Novel Architectural Elements

Multi-answer generation head: Policy is modified to output a set of N points in a single forward pass rather than a single point
Integration of geometric collinearity checks directly into the RL reward signal

Modeling

Base Model: Qwen2.5-VL-3B-Instruct and Qwen2.5-VL-7B-Instruct

Training Method: RLOO (REINFORCE Leave-One-Out)

Objective Functions:

Purpose: Maximize expected reward.

Formally: Gradient ascent on J(θ) = E[R(a, B)] using RLOO estimator
Purpose: Balance exploration utility and cost.

Formally: R_accuracy = sign(U) * sqrt(U^2 / (N * C_v)), where C_v is verification cost (rank k if success, N if failure)
Purpose: Penalize collinear outputs.

Formally: If points are collinear, R_accuracy = -1
Purpose: Ensure output format validity.

Formally: R_total = R_format + R_accuracy * R_format

Training Data:

Mixture of Widget Caption, OmniAct, GUICourse (approx 44k samples)
Filtered: Samples where naive model gets 8/8 correct are removed (too easy)

Key Hyperparameters:

learning_rate: 1e-6
rollout_batch_size: 128
rloo_rollout_number: n=8
+ 2 more
epochs: 3
sampling_temperature: 1.0 (for filtering)

Compute: 16 H800 GPUs

Comparison to Prior Work

vs. Naive RLVR: AEPO uses multi-answer generation + adaptive rewards vs. single-answer + binary rewards
vs. GUI-R1: AEPO avoids distance/IoU-based rewards in favor of efficiency-based discrete rewards
vs. SFT (SeeClick/UGround): AEPO is far more data-efficient (44k samples vs >1M for some SFT baselines)

Limitations

Computational cost of multi-answer generation during inference (though analysis shows it is more efficient than multi-pass naive sampling)
Reliance on accurate Ground Truth bounding boxes for reward verification
Performance depends on the quality of the base VLM (Qwen2.5-VL)

Reproducibility

Code: https://github.com/InfiXAI/InfiGUI-G1

Code and resources available at https://github.com/InfiXAI/InfiGUI-G1. Uses open-source Qwen2.5-VL backbones. Training data sources are public (Widget Caption, OmniAct, etc.). Exact filtering implementation details provided.

📊 Experiments & Results

Evaluation Setup

GUI Grounding: Predict coordinates for a target element given a screenshot and instruction.

Benchmarks:

MMBench-GUI (Hierarchical instructions (basic & advanced))
ScreenSpot-Pro (High-res screens, distinct text vs. icon grounding)
UI-Vision (Generalization to unseen desktop applications)
ScreenSpot-v2 (Mobile/Desktop/Web coverage)
UI-I2E-Bench (Implicit instructions requiring semantic reasoning)

Metrics:

Accuracy (Point within Bounding Box)
Exploration Success Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
UI-Vision	Accuracy	24.3	26.1	+1.8
ScreenSpot-v2	Accuracy	Not reported in the paper	93.5	Not reported in the paper
Multiple Benchmarks	Relative Improvement	Not reported in the paper	Not reported in the paper	+9.0%

Main Takeaways

AEPO significantly outperforms Naive RLVR, especially on semantically demanding tasks (ScreenSpot-Pro Icon grounding) where exploration is crucial.
The model learns an adaptive strategy: it generates more candidate answers for difficult benchmarks (2.1 for UI-Vision) and fewer for easy ones (1.4 for ScreenSpot-v2).
Multi-answer generation is efficient: InfiGUI-G1 finds correct answers in a single pass (avg ~2 answers) more often than Naive RLVR does in 4 independent attempts.
Ablation studies confirm all components are necessary: removing Collinear Penalty drops accuracy (model hacks reward with dense points), and removing AER's rank-awareness (k) reduces the model's confidence in the correct answer.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients)
Multimodal Large Language Models (MLLMs)
GUI Grounding concepts (bounding boxes, coordinates)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—using objective outcomes (e.g., correct coordinates) to train models via RL

AEPO: Adaptive Exploration Policy Optimization—the proposed framework combining multi-answer generation, adaptive rewards, and collinear penalties

AER: Adaptive Exploration Reward—a reward function based on efficiency (Utility divided by Cost), incentivizing the model to rank correct answers higher

RLOO: REINFORCE Leave-One-Out—a policy gradient algorithm that reduces variance by using the average reward of other samples in the batch as a baseline

SFT: Supervised Fine-Tuning—training models on labeled examples (input-output pairs) before applying reinforcement learning

Collinearity: A geometric property where points lie on the same straight line; penalized here to force spatial diversity

IoU: Intersection over Union—a metric measuring overlap between predicted and ground-truth bounding boxes

MCTS: Monte Carlo Tree Search—a search algorithm used for decision-making processes