DPO: Direct Preference Optimization—an offline method to align LLMs to preferences without an explicit reward model loop
SFT: Supervised Fine-Tuning—training a model on high-quality demonstration data before alignment
IPO: Identity Preference Optimization—a DPO variant designed to mitigate overfitting and improve generalization
KTO: Kahneman-Tversky Optimization—an alignment method maximizing utility of generations directly, eliminating the need for paired preferences
PP: Preference Pruning—the authors' proposed method to select data generation parameters based on statistical overlap (BLEU/ROUGE) with reference texts
MT-Bench: A benchmark suite consisting of multi-turn questions across 8 domains (writing, reasoning, math, etc.) evaluated by GPT-4
BLEU: Bilingual Evaluation Understudy—a metric measuring word overlap between a generated text and a reference
ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a metric measuring n-gram overlap, commonly used for summarization