GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines from the mean rewards of a group of outputs rather than using a separate critic model
REC: Referring Expression Comprehension—locating a specific object in an image based on a natural language description
OVD: Open-Vocabulary Object Detection—detecting and classifying objects in an image where the classes are not limited to a predefined set
IoU: Intersection over Union—a metric measuring the overlap between a predicted bounding box and the ground truth box
mAP: mean Average Precision—a comprehensive metric for object detection accuracy across different recall levels
SFT: Supervised Fine-Tuning—training a model on labeled examples using standard cross-entropy loss
Reward Hacking: When an RL agent exploits loopholes in the reward function to maximize score without solving the underlying task (e.g., predicting too many boxes to game recall)
OD aha moment: An emergent behavior where the model spontaneously generates reasoning steps (thinking about object presence) before predicting bounding boxes, improving accuracy