GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages based on the relative performance of a group of outputs rather than using a separate value function network
SFT: Supervised Fine-Tuning—training a model on input-output pairs to mimic the desired behavior
CoT: Chain-of-Thought—a prompting or training technique where models generate intermediate reasoning steps before the final answer
VQA: Visual Question Answering—a task where a model answers natural language questions about an input image
PPO: Proximal Policy Optimization—a popular RL algorithm (which GRPO improves upon for efficiency) that uses a clipped objective to ensure stable policy updates
OOD: Out-of-Distribution—data that differs significantly from the training data (e.g., testing on X-rays after training on MRIs)
KL divergence: Kullback–Leibler divergence—a statistical measure used here as a penalty to prevent the RL-trained model from drifting too far from its original pre-trained state
MRI: Magnetic Resonance Imaging—a medical imaging technique
CT: Computed Tomography—a medical imaging technique using X-rays
Zero-shot: Testing a model on a task or domain it has not explicitly seen during training