RLHF: Reinforcement Learning from Human Feedback—a method to fine-tune language models using rewards derived from human preferences
PPO: Proximal Policy Optimization—an RL algorithm used to update the language model policy while preventing it from changing too drastically
SFT: Supervised Fine-Tuning—the first stage of alignment where the model learns to mimic high-quality human demonstrations
RM: Reward Model—a model trained to predict which of two responses a human would prefer, used to guide the RL phase
DPO: Direct Preference Optimization—an alternative to PPO that optimizes the policy directly on preference data without an explicit reward model
KL penalty: Kullback-Leibler divergence penalty—a regularizer added to the reward to ensure the RL-tuned model doesn't drift too far from the original reference model
EOS token: End of Sequence token—a special token indicating the end of a generation
ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a set of metrics used to evaluate automatic summarization by comparing them to reference summaries