Set-of-Mark: A prompting technique where visual markers (e.g., numbered boxes) are overlaid on image objects to allow models to reference specific regions by ID
GUI: Graphical User Interface—the visual display of apps involving icons, text, and buttons
LMM: Large Multimodal Model—an AI model capable of processing and reasoning over both text and images (e.g., GPT-4V)
Zero-shot: The ability of a model to perform a task without seeing any specific training examples for that task
OCR: Optical Character Recognition—technology that converts text within images into machine-readable text data
AITW: Android in the Wild—a large-scale dataset of human demonstrations for controlling Android devices
HTML syntax: A text-based representation of screen elements used by baseline models to understand UI layout without seeing the image
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique for large language models