Visual-RFT: Visual Reinforcement Fine-Tuning—the proposed method using verifiable rule-based rewards to train LVLMs via reinforcement learning
RFT: Reinforcement Fine-Tuning—fine-tuning models using RL feedback (correct/incorrect) rather than just supervised label imitation
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same input to update the policy without a critic model
Verifiable Rewards: Reward signals determined by explicit rules (e.g., 'Is IoU > 0.5?') rather than a neural network prediction
LVLM: Large Vision-Language Model—AI models capable of processing and reasoning about both images and text
IoU: Intersection over Union—a standard metric for object detection measuring the overlap between a predicted bounding box and the ground truth
CoT: Chain-of-Thought—intermediate reasoning steps generated by the model before the final answer
SFT: Supervised Fine-Tuning—training a model to mimic input-output pairs from a dataset
KL divergence: A statistical measure used to ensure the RL-updated model does not deviate too drastically from the reference model
mAP: mean Average Precision—a comprehensive metric for evaluating object detection accuracy across different confidence thresholds