RLVR: Reinforcement Learning with Verifiable Rewards—using objective outcomes (e.g., correct coordinates) to train models via RL
AEPO: Adaptive Exploration Policy Optimization—the proposed framework combining multi-answer generation, adaptive rewards, and collinear penalties
AER: Adaptive Exploration Reward—a reward function based on efficiency (Utility divided by Cost), incentivizing the model to rank correct answers higher
RLOO: REINFORCE Leave-One-Out—a policy gradient algorithm that reduces variance by using the average reward of other samples in the batch as a baseline
SFT: Supervised Fine-Tuning—training models on labeled examples (input-output pairs) before applying reinforcement learning
Collinearity: A geometric property where points lie on the same straight line; penalized here to force spatial diversity
IoU: Intersection over Union—a metric measuring overlap between predicted and ground-truth bounding boxes
MCTS: Monte Carlo Tree Search—a search algorithm used for decision-making processes