GUI Grounding: The task of mapping a natural language description (e.g., 'click the search bar') to specific screen coordinates.
IoU: Intersection-over-Union—a metric measuring the overlap between a predicted bounding box and the ground truth box (1.0 = perfect match).
GIoU: Generalized IoU—a variant of IoU that provides a meaningful score even for non-overlapping boxes by considering their proximity and enclosing area.
Region Proposal: A candidate area of an image likely to contain an object, used in object detection to focus computational resources on relevant sub-regions.
SeeClick: The baseline Vision-Language Model specifically pre-trained for GUI grounding tasks (based on Qwen-VL).
RoPE: Rotary Positional Embedding—a method for encoding token positions in Transformers, modified here to handle multiple box hypotheses efficiently.
AgentStudio: A benchmark dataset for evaluating autonomous GUI agents.
ScreenSpot: A benchmark dataset for evaluating GUI element grounding.
AITW: Android in the Wild—a dataset for mobile GUI navigation.
Mind2Web: A dataset for web-based GUI navigation.