GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a sampled group of outputs for the same input, removing the need for a critic model.
VLLM: Visual Large Language Model—a multimodal model capable of understanding images and generating text or structured outputs.
SAM2: Segment Anything Model 2—a foundation model for image segmentation that takes prompts (boxes/points) to generate masks.
IoU: Intersection over Union—a metric measuring the overlap between a predicted bounding box/mask and the ground truth.
gIoU: Generalized Intersection over Union—an extension of IoU that handles non-overlapping boxes.
RLVR: Reinforcement Learning with Verifiable Rewards—using ground-truth verification (like correct answers in math) to guide RL training.
ECDF: Empirical Cumulative Distribution Function—used here to map raw reward values to their rank/quantile within a history buffer.
Reasoning Segmentation: A task requiring the model to reason about complex instructions to identify and segment a target object.