GUI: Graphical User Interface—the visual part of a website or app that users interact with
Trajectory: A recorded sequence of an agent's interactions, including observations (screenshots), internal thoughts, and actions taken to solve a task
DOM: Document Object Model—the code structure representing a webpage's content
AXTree: Accessibility Tree—a simplified version of the DOM used by screen readers (and agents) that focuses on interactive elements and their semantic roles
VLM: Vision-Language Model—an AI model capable of processing both images (screenshots) and text
Playwright: A software library used to automate web browsers, allowing the agent to programmatically control the browser
SFT: Supervised Fine-Tuning—training a model on a labeled dataset to improve its performance on specific tasks
FastText: A library for efficient text classification and representation learning
RedPajama: A large-scale open-source dataset of text collected from the internet, used here as the source for tutorials
Grounding: The ability of an agent to link abstract concepts (e.g., 'search button') to specific pixel coordinates or code elements on a screen