EXRM: Explicit Reward Model—a separate classifier trained on preference data to predict a scalar reward for a prompt-response pair
DPORM: DPO Reward Model—the implicit reward function defined by the log-ratio of the DPO-trained policy and the reference policy
DPO: Direct Preference Optimization—an alignment method that optimizes the policy directly on preference data without training a separate reward model
RLHF: Reinforcement Learning from Human Feedback—a three-stage process involving supervised fine-tuning, reward modeling, and reinforcement learning (usually PPO)
OOD: Out-of-Distribution—data samples that differ significantly (in prompt domain or response style) from the training set
Iterative DPO: An alignment strategy where the model generates new responses, a reward model labels them, and DPO is applied to this new dataset iteratively
AlpacaEval: A benchmark for evaluating instruction-following models by comparing their responses against a reference model (often GPT-4) using an LLM judge