GUI: Graphical User Interface—visual interface of computers/phones involving icons, windows, and menus
VLM: Visual Language Model—AI that understands both images and text
DOM: Document Object Model—structured representation of web pages (HTML tree)
OCR: Optical Character Recognition—converting text in images into machine-readable text
Visual Grounding: Locating specific objects or elements in an image based on a text description
FLOPs: Floating Point Operations—a measure of computational cost
ViT: Vision Transformer—model that processes images as sequences of patches
CogVLM: The base VLM architecture CogAgent is built upon, featuring a 'visual expert' module in the language decoder
EVA2-CLIP: A strong pre-trained vision encoder used to extract features from images