RLVR: Reinforcement Learning with Verifiable Rewards—using objective, rule-based checks (like 'is the answer correct?') to guide model training instead of human feedback or static labels
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of sampled outputs for the same input to reduce variance, without needing a separate value function network
IoU: Intersection over Union—a metric for object detection measuring the overlap between a predicted bounding box and the ground truth box (0 = no overlap, 1 = perfect match)
Chain of Thought (CoT): A prompting technique where the model is encouraged to generate intermediate reasoning steps before producing the final answer
VQA: Visual Question Answering—the task of answering natural language questions about an image
Visual Grounding: The task of locating an object in an image (usually via bounding box) based on a text description
KL Divergence: Kullback-Leibler Divergence—a statistical distance measure used here as a penalty to prevent the RL-trained model from drifting too far from the original base model's language distribution
ViT: Vision Transformer—an architecture that processes images as sequences of patches, used here as the visual encoder