Preference Tree: A data structure where an instruction is the root, and branches represent different reasoning attempts (correct and incorrect), enabling preference learning at multiple turns.
DPO: Direct Preference Optimization—an algorithm optimizing the policy to satisfy preferences without an explicit reward model. Found here to fail for reasoning.
KTO: Kahneman-Tversky Optimization—a preference learning method that aligns models using unpaired binary feedback (good/bad) rather than paired comparisons.
NCA: Noise Contrastive Alignment—a method that aligns models by contrasting the probability of the chosen response against a noise distribution.
SFT: Supervised Fine-Tuning—training the model on high-quality demonstration data before alignment.
Bradley-Terry (BT) Model: A standard statistical model for estimating the probability that one item is preferred over another, used in training reward models.
UltraInteract: The newly curated dataset in this paper containing 220K interaction trajectories structured as preference trees.