IPS: Inverse Propensity Score—an evaluation metric that weights correct predictions based on the probability of the item being shown, accounting for varying candidate set sizes (e.g., guessing correctly out of 40 vs. 2 options)
SFT: Supervised Fine-Tuning—training a model on labeled examples of inputs and desired outputs
DPO: Direct Preference Optimization—a method to align language models to preferences by optimizing the relative likelihood of chosen vs. rejected responses without a separate reward model
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and trains small rank-decomposition matrices
Reasoning Distillation: Using a larger, more capable model to generate explanations or reasoning steps for a correct answer, then training a smaller model on these explanations
VLM: Visual Language Model—a model capable of understanding and generating text based on image inputs, used here to caption artworks