POMDP: Partially Observable Markov Decision Process—a mathematical framework for modeling decision-making where the agent cannot directly observe the full state of the environment
Set-of-Marks: A prompting technique where interactive elements on a screen are overlaid with numeric tags (marks), allowing a vision model to reference specific UI elements by ID
UI Automation tree: A hierarchical representation of the user interface elements provided by the OS for accessibility tools (screen readers), used here to ground agent observations
DOM: Document Object Model—a tree structure representing the content of a web page
VLM: Vision-Language Model—an AI model capable of understanding and generating content based on both image and text inputs (e.g., GPT-4V)
RGB array: A grid of pixels representing the screen's visual output (Red, Green, Blue channels)
Azure Machine Learning: A cloud service for managing ML lifecycles, used here to orchestrate parallel agent evaluations