DPO: Direct Preference Optimization—a stable method for training language models on preference pairs without an explicit reward model loop
RLHF: Reinforcement Learning from Human Feedback—aligning models by training them to maximize a reward signal derived from human preferences
off-policy: Training a model using data generated by a different model (the behavior policy) rather than the model currently being trained
on-policy: Training a model using data generated by the model itself during the training process
distributional gap: The difference between the statistical distribution of data used for training and the distribution of data the model would naturally generate
Alpaca Eval 2: A benchmark for evaluating instruction-following capabilities of LLMs using an LLM-based judge to compare against a baseline
MT-bench: A benchmark consisting of multi-turn conversation questions to evaluate LLMs on reasoning, coding, and roleplay
SFT: Supervised Fine-Tuning—the initial training phase where a model learns to follow instructions from labeled examples before RLHF
hybrid RL setting: A training setup that mixes static off-policy preference data with new on-policy samples generated by the current model