DPO: Direct Preference Optimization—a method to align language models to preferences without a separate reward model loop
DAG: Directed Acyclic Graph—a structure used here to model the flow of data between different AI components without loops
Compound AI System: A system composed of multiple interacting AI models (e.g., an LLM calling a diffusion model or another LLM)
SysDPO-Direct: A variant of the proposed framework that assumes intermediate outputs are observed in the preference dataset
SysDPO-Sampling: A variant that samples intermediate outputs (using beam search) during training when they are not provided in the dataset
DBS: Diverse Beam Search—a decoding algorithm that encourages diversity among generated candidates
beta-perfect alignment: A theoretical state where the model's likelihood ratios perfectly match the oracle's preference probabilities (scaled by temperature beta)
Bradley-Terry model: A statistical model that predicts the probability of one item being preferred over another based on their underlying reward scores