DPO: Direct Preference Optimization—an algorithm that fine-tunes LMs on preference pairs (winner/loser) without explicitly training a separate reward model network.
LLM-as-a-Judge: Using a Language Model to evaluate the quality of text responses, typically by prompting it to assign a score or select a winner.
IFT: Instruction Fine-Tuning—supervised training on (prompt, response) pairs.
EFT: Evaluation Fine-Tuning—supervised training on (evaluation prompt, evaluation rationale + score) pairs to teach the model how to judge quality.
AIFT: AI Feedback Training—training data created by the model itself, consisting of prompts, generated responses, and self-assigned scores/preferences.
Self-Instruction Creation: The process where the model generates new prompts, then generates candidate responses for them, and finally scores those responses itself.
SFT: Supervised Fine-Tuning—standard training on labeled examples.