VLM: Vision-Language Model—multimodal AI models that can process both images and text to reason and generate outputs
Set-of-Mark (SoM): A prompting technique where interactive elements on a screen are overlaid with numeric tags/bounding boxes to help the model reference specific locations
a11y tree: Accessibility tree—a hierarchical representation of a user interface's structure and text, used by screen readers and often provided to AI agents for better understanding
Attack Success Rate (ASR): The frequency with which the agent clicks on the malicious pop-up instead of performing the intended task
Attention Hook: A text component of the adversarial pop-up designed to grab the agent's attention, often by mimicking the user's intent (e.g., 'VIRUS DETECTED' or a summary of the query)
Malvertising: The practice of incorporating malware in online advertisements
ALT text: Alternative text—a textual description of an image element in HTML, used here to mislead agents relying on text representations