LVLM: Large Vision-Language Model—an AI model capable of processing both text and images to perform reasoning tasks
V-ToolRL: The authors' proposed reinforcement learning framework designed to teach LVLMs how to use visual tools adaptively
SFT: Supervised Fine-Tuning—training a model on labeled examples (static trajectories) before applying reinforcement learning
GRPO: Group-wise Proximal Policy Optimization—an RL algorithm that optimizes policies by comparing a group of sampled outputs for the same input, often used to stabilize training without a separate value model
GroundingDINO: A vision tool that performs open-set object detection based on text queries (finding objects described by text)
SAM: Segment Anything Model—a tool that generates high-quality segmentation masks for objects in an image
OCR: Optical Character Recognition—technology that extracts text from images
Cold-Start: The initial phase of training where the model is supervised-fine-tuned on synthetic data to learn basic tool syntax before RL exploration
Tool Controller: A module in the framework that parses model actions, dispatches requests to distributed tool services, and aggregates results