RLHF: Reinforcement Learning from Human Feedback—a technique for aligning AI models with human values using preference data
PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm that optimizes policies using clipped updates to ensure stability
DPO: Direct Preference Optimization—an algorithm that optimizes language models to satisfy preferences directly without training a separate reward model
MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker
MLE: Maximum Likelihood Estimation—a method for estimating the parameters of a probability distribution by maximizing a likelihood function
Bandit: A simplified reinforcement learning setting (Contextual Bandit) where the agent makes a single decision (entire sentence) and receives one reward, without state transitions
SFT: Supervised Fine-Tuning—the initial phase of training where the model learns to mimic high-quality demonstrations
KL divergence: Kullback-Leibler divergence—a statistical distance measure used to prevent the aligned model from drifting too far from the reference model
RTO: Reinforced Token Optimization—the proposed algorithm that uses DPO-derived token rewards to guide PPO training