RLHF: Reinforcement Learning from Human Feedback—a method to train language models using human preferences as a reward signal
PPO: Proximal Policy Optimization—an RL algorithm used to update the language model policy based on rewards
Fine-grained reward: A reward signal provided for specific segments (e.g., sentences) or specific error types, rather than a single score for the whole output
Holistic reward: A single scalar score representing the overall quality of an entire generated sequence
RougeLSum: An automatic metric for evaluating text generation based on the overlap of n-grams (specifically longest common subsequence) between generated and reference text
Perplexity (PPL): A measurement of how well a probability model predicts a sample; lower values indicate the text is more 'natural' or predictable to the model
SFT: Supervised Fine-Tuning—training the model on high-quality demonstration data before applying RL
KL divergence penalty: A term added to the reward function to prevent the RL-trained model from deviating too far from the initial pre-trained model