DAP: Direct Alignment from Preferences—a family of methods (like DPO, IPO, SLiC) that optimize a policy directly from preference data without a separate reward model
OAIF: Online AI Feedback—the proposed method where preferences are generated on-the-fly by an LLM annotator for the model's own outputs
DPO: Direct Preference Optimization—a specific DAP algorithm that optimizes a loss function derived from the theoretical optimal policy for a given reward function
RLHF: Reinforcement Learning from Human Feedback—the standard alignment pipeline using a Reward Model and PPO
RLAIF: Reinforcement Learning from AI Feedback—similar to RLHF but uses an AI model instead of humans to generate the feedback/preferences
on-policy: Learning from data generated by the current version of the model being trained (as opposed to old or static data)
off-policy: Learning from data generated by a different policy (e.g., a static dataset collected before training started)
SFT: Supervised Fine-Tuning—the initial training phase where a model learns to mimic high-quality demonstrations