MOD: Multi-Objective Decoding—the proposed algorithm that combines predictions of base models at inference time
Legendre transform: A mathematical operation that relates a function to its convex conjugate; used here to map between reward space and policy space
PPO: Proximal Policy Optimization—an RL algorithm used for alignment
DPO: Direct Preference Optimization—an algorithm optimizing a policy to satisfy preferences without an explicit reward model
strong-barrier function: A regularizing function (like x log x for KL) that is continuously differentiable and strongly convex, allowing a bijective mapping between rewards and policies
Pareto frontier: The set of optimal solutions where no objective can be improved without degrading another
SFT: Supervised Fine-Tuning—the initial training phase of a model on instruction data
logit: The raw, unnormalized output scores of a neural network before the softmax layer
f-divergence: A family of measures quantifying the difference between two probability distributions (includes KL, Reverse KL, etc.)