RLHF: Reinforcement Learning from Human Feedback—a method aligning models using a reward model trained on human preferences
RLAIF: Reinforcement Learning from AI Feedback—similar to RLHF but uses an AI system to provide the preference labels instead of humans
DPO: Direct Preference Optimization—a stable method that optimizes the policy directly on preference data without training a separate reward model
PPO: Proximal Policy Optimization—a standard RL algorithm used to update model weights based on reward scores while preventing destructive large updates
CoT: Chain-of-Thought—a prompting strategy where models generate intermediate reasoning steps
SFT: Supervised Fine-Tuning—training the model on labeled input-output pairs before applying RL
Reward Model: A separate neural network trained to predict human preference scores for a given response
Cold Start: The initial phase of training (often using high-quality SFT data) to prepare a model for effective reinforcement learning
Policy: The LLM itself, viewed as an agent that decides which token (action) to generate next given the context (state)