OPG: Oracle Performance Gap—a metric quantifying the performance difference between a model trained on the standard training set and an 'Oracle' model trained directly on the test set
GRPO: Group Relative Policy Optimization—a state-of-the-art RL algorithm that optimizes policies based on group-wise relative rewards
SFT: Supervised Fine-Tuning—training a model on labeled examples (input-output pairs) using standard cross-entropy loss
CoT: Chain-of-Thought—a prompting or training strategy where models generate intermediate reasoning steps before the final answer
pass@1: The percentage of problems where the model's single generated answer matches the ground truth
OOD: Out-of-Distribution—data that differs semantically or structurally from the data the model was trained on
counterfactual reasoning: The ability to reason correctly from false or altered premises (e.g., 'If cats could fly...') rather than relying on prior world knowledge
RLHF: Reinforcement Learning from Human Feedback—methods to align LLMs using reward models trained on human preferences
DPO: Direct Preference Optimization—an algorithm for aligning LLMs to preferences without an explicit reward model loop