GUI: Graphical User Interface—visual interface on computers/phones
MLLM: Multimodal Large Language Model—AI that processes both text and images
SFT: Supervised Fine-Tuning—training a model on labeled examples
Accessibility Tree: A text-based structural representation of a UI (e.g., HTML DOM or Android View hierarchy) used by screen readers
Set-of-Marks: A technique where numerical IDs are overlaid on image elements to help models reference them
Hierarchical Reasoning: Breaking a task into high-level strategy (planning) and low-level tactics (execution)
Expectation-Reflection: A reasoning pattern where the agent predicts an outcome before acting and evaluates the result afterward
Grounding: Linking textual concepts (e.g., 'Submit button') to specific visual coordinates
Trajectory: A sequence of states, actions, and observations recorded during a task interaction