GUI: Graphical User Interface—the visual display of apps and operating systems
Visual Grounding: The process of mapping a natural language description (e.g., 'click the file menu') to a specific location on an image or screen
ViT: Vision Transformer—a neural network architecture that processes images by splitting them into fixed-size patches
NTP: Next-Token Prediction—the standard training objective for language models where the model predicts the next word in a sequence
UI-TARS: A baseline state-of-the-art GUI agent model mentioned for comparison
ScreenSpot-Pro: A benchmark dataset for evaluating GUI grounding capabilities
OS-Atlas: A large-scale GUI dataset used for training the grounding verifier
ROI pooling: Region of Interest pooling—a technique to extract features from specific rectangular regions in an image
OCR: Optical Character Recognition—converting text in images into machine-readable text