DPO: Direct Preference Optimization—an algorithm that optimizes language models to satisfy preferences without explicitly training a reward model in the loop
Online DPO: A variant of DPO where the model generates its own training data (responses) during training, which are then scored/labeled, reducing distribution shift compared to offline data
Pareto frontier: The set of optimal trade-offs between conflicting objectives where no objective can be improved without degrading another
Dirichlet sampling: A probability distribution used here to sample valid combinations of objective weights (percentages) during training to ensure diverse coverage of the trade-off space
Steerable policy: A single model capable of adjusting its behavior at inference time based on an input control signal (like a weight vector)
Model souping: A technique of averaging the weights of multiple models trained on different objectives to create a new model that balances those objectives
KL regularization: A penalty term that prevents the trained model from diverging too far from a reference model (usually the pre-trained base model)