SFT: Supervised Fine-Tuning—training a model on labeled examples (prompts and answers) to learn to follow instructions
RLHF: Reinforcement Learning with Human Feedback—a multi-step alignment process using a reward model to guide the language model
DPO: Direct Preference Optimization—an alignment method that optimizes the policy directly on preference pairs relative to a reference model
ORPO: Odds Ratio Preference Optimization—the proposed method that combines SFT and preference alignment into one step using an odds ratio penalty
Odds Ratio: A statistic quantifying how much more likely the model is to generate the chosen response compared to the rejected response
Reference Model: A frozen copy of the pre-trained or SFT model used in DPO/RLHF to prevent the active model from drifting too far (KL divergence constraint)
NLL: Negative Log-Likelihood—the standard loss function used in language modeling to maximize the probability of the correct next token
Monolithic: Refers to a single, unified training phase rather than a multi-stage pipeline