SFT: Supervised Fine-tuning—training models to predict the next token given ground truth history
Exposure Bias: The discrepancy where models are trained on ground truth history but must generate based on their own potentially erroneous predictions during inference
On-policy: Learning algorithms that optimize the model based on data generated by the current version of the model itself
Off-policy: Learning algorithms that optimize the model using a static dataset collected from a different policy (e.g., a previous version or humans)
NES: Natural Evolution Strategy—an optimization class that updates parameters by estimating gradients using random perturbations (mutations) and their fitness scores
Perturbation signal: In this paper, the gradient of the log-probability of a generated sentence, used as a 'mutation' direction in parameter space
PPO: Proximal Policy Optimization—a standard RL algorithm that constrains policy updates to be small to ensure stability
DPO: Direct Preference Optimization—an off-policy method that optimizes preferences directly without an explicit reward model loop
RRHF: Rank Responses to Align Human Feedback—a method that aligns models by ranking candidate responses and optimizing their probabilities accordingly