GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same input, avoiding the need for a separate value function critic
SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs
CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer
VQA: Visual Question Answering—a task where a model answers text questions about an image
Think-After: A proposed reasoning protocol where the model predicts the answer first, then generates the explanation, ensuring the reasoning does not interfere with the initial prediction accuracy
PPO: Proximal Policy Optimization—a popular RL algorithm that updates policies using a clipped objective function to ensure stability
Hallucination: When a model generates plausible-sounding but factually incorrect information, a common issue in medical CoT when domain knowledge is weak