PPO-Clip: Proximal Policy Optimization with Clipping—an RL algorithm that limits policy updates to a trusted region by clipping the probability ratio, ensuring stability.
f-divergence: A general family of divergence measures between probability distributions (includes KL divergence, Chi-squared, etc.) used here to regularize the policy.
Łojasiewicz inequality: A mathematical condition relating a function's value gap to its gradient norm, often used to prove faster (linear) convergence rates for non-convex problems.
Softmax policy: A policy parameterization where action probabilities are proportional to the exponential of learned parameters (logits).
Lipschitz smoothness: A condition where the gradient of a function does not change arbitrarily fast; essential for bounding the descent of optimization steps.
RLHF: Reinforcement Learning from Human Feedback—a technique to align AI models with human preferences using reward models trained on human data.
Policy Drift: The phenomenon where an optimized policy deviates significantly from the initial reference policy, often leading to reward hacking.
Forward KL: Kullback-Leibler divergence D_KL(P || Q); in this context, penalizing deviations where the reference policy has low probability.
Reverse KL: Kullback-Leibler divergence D_KL(Q || P); the standard regularizer in RLHF, known for mode-seeking behavior.
Mode-seeking: The tendency of an optimization process (like reverse KL) to collapse a distribution onto a single high-probability mode rather than covering the full support.