RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using rewards learned from human preferences
Safe RLHF: A variant of RLHF that decouples helpfulness and harmlessness, optimizing reward subject to a cost constraint
FSD: First-Order Stochastic Dominance—a condition where one distribution is uniformly 'better' (has lower cumulative probability for high costs) than another across all outcomes
CVaR: Conditional Value at Risk—a risk measure quantifying the expected loss in the worst alpha% of cases (tail risk)
Optimal Transport: A framework for measuring distances between probability distributions by calculating the cheapest way to move mass from one to the other
Sinkhorn iterations: An algorithm to efficiently solve entropically regularized optimal transport problems, making them differentiable
Spectral Risk Measures: A class of risk measures that weight different quantiles of the cost distribution (e.g., prioritizing the tail) to define a total risk score
Quantile Function: The inverse of the Cumulative Distribution Function; maps a probability p to the value below which p% of the data falls
Dual Ascent: An optimization method that alternates between updating the primal variables (policy parameters) and the dual variables (Lagrange multipliers) to satisfy constraints