DPO: Direct Preference Optimization—an algorithm that fine-tunes models on preference pairs directly without training a separate reward model
implicit reward: The mathematical reward value that can be analytically derived from the probability ratios of a DPO-trained policy and its reference model
bootstrapping: A process where a system improves itself using its own previous outputs as training data, without external input
length exploitation: A failure mode where models learn to generate longer text because evaluators (humans or models) bias towards verbosity regardless of quality
experience replay: A technique from continual learning where past training data is mixed with new data to prevent the model from forgetting previously learned information
AlpacaEval 2: A benchmark for evaluating instruction-following models using an LLM-based automatic evaluator that corrects for length bias
LC win rate: Length-Controlled win rate—a metric that measures how often a model wins against a baseline while statistically adjusting for the length of responses
reward shaping: Modifying the reward function (in this case, by adding a length penalty) to guide the learning process towards more desirable behaviors
SFT: Supervised Fine-Tuning—the initial phase of training where a model learns to follow instructions from labeled examples
Zephyr: A specific series of language models aligned using DPO, used here as a base model
Llama-3: A family of open-weights large language models developed by Meta