GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs for the same input, removing the need for a learned value function critic
CoT: Chain-of-Thought—a prompting or training technique where models generate intermediate reasoning steps before the final answer
SFT: Supervised Fine-Tuning—training a model on input-output pairs to minimize token prediction error
Distillation: The process of training a smaller 'student' model to mimic the outputs or reasoning of a larger, often proprietary 'teacher' model (like GPT-4)
Informativeness: A data quality metric defined in this paper, proxied by question length and difficulty; high informativeness (e.g., lengthy clinical vignettes) correlates with better reasoning emergence
MedXpert: A challenging medical QA benchmark focusing on complex clinical reasoning and expert-level decision making
PPO: Proximal Policy Optimization—a standard RL algorithm that uses a value function to estimate advantages and clips updates for stability
RLHF: Reinforcement Learning with Human Preferences—aligning models using rewards derived from human rankings of outputs