Visual Grounding: The task of locating a specific object or element in an image based on a natural language description
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs against their mean, removing the need for a separate value network
SFT: Supervised Fine-Tuning—training a model on a fixed dataset of input-output pairs
DOM: Document Object Model—the underlying code structure of a web page, which often contains elements not actually visible to the user
VLM: Vision-Language Model—an AI model capable of understanding and generating content based on both image and text inputs
Dense Point Reward: A continuous reward signal calculated based on the normalized distance between a predicted point and the ground truth center, providing smoother gradients than binary rewards
IoU: Intersection over Union—a metric measuring the overlap between a predicted bounding box and the ground truth box
KL divergence: Kullback-Leibler divergence—a statistical distance measure used here to ensure the RL-tuned model doesn't drift too far from the original reference model