LPF: Learning from Pairwise Feedback—the process of training models using binary preference data (A vs. B) rather than scalar rewards
SFT: Supervised Fine-Tuning—training a model on high-quality instruction-response pairs before applying reinforcement learning
PPO: Proximal Policy Optimization—an RL algorithm used to update the language model policy to maximize reward while remaining close to the initial policy
Best-of-n: An inference-time method that generates 'n' samples and selects the one with the highest predicted reward from a reward model
Expert Iteration: A learning method where a model generates data, high-quality samples are selected (filtered), and the model is fine-tuned on those selected samples
Oracle API LLMs: Large, high-capability models (like GPT-4) accessed via API, used here to simulate human judgment
Win-rate: The percentage of times a model's output is preferred over a reference model's output (usually Davinci003) in a pairwise comparison
Davinci003: A specific version of OpenAI's GPT-3 model optimized for instruction following, used as a reference baseline
Alpaca data: A dataset of 52k instruction-following examples generated by text-davinci-003, used for initial SFT
Simulated Annotator: An LLM prompted to act as a human labeler, including specific noise and bias characteristics to match human behavior