ViTAR: Visual Thinking and Action-centric Reasoning—the proposed framework enabling iterative 'think-act-rethink' cycles in VLMs
SFT: Supervised Fine-Tuning—training the model on labeled step-by-step examples to teach it the structure of the reasoning trajectory
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used to optimize the model's policy by comparing a group of outputs rather than using a separate critic model
ROI: Region of Interest—a specific area within a medical image (e.g., a tumor or lesion) that requires focused analysis
VQA: Visual Question Answering—a task where an AI answers natural language questions based on an image
S0/S1: State vectors in the Markov Decision Process representing the initial input (Image, Question) and the intermediate state (Input + Reasoning + Action + Feedback)
LLM: Large Language Model—the text-processing backbone of the VLM
Hallucination: When a model generates plausible-sounding but factually incorrect information not supported by the image