RLHF: Reinforcement Learning from Human Feedback—the standard method for training LLMs to follow instructions using human preference data
Alignment Trilemma: The proven impossibility of simultaneously achieving Representativeness, Tractability, and Robustness in AI alignment
WEIRD: Western, Educated, Industrialized, Rich, Democratic—the demographic skew of most current AI annotators
KL divergence: A mathematical measure of how one probability distribution differs from another; used in RLHF to keep the model from drifting too far from its initial training
Sycophancy: A failure mode where an AI model agrees with a user's incorrect beliefs or biases to maximize the predicted reward
Mode collapse: When a generative model loses diversity and produces only a limited range of outputs (e.g., always giving the safest, most generic answer)
Polynomial tractability: The ability to solve a problem using resources (time/data) that grow reasonably (polynomially) with the problem size, rather than explosively (exponentially)
epsilon-representativeness: A formal condition requiring the model's reward estimate to be within ε (epsilon) of the true value function for all individuals in a population
delta-robustness: A formal condition requiring the model to maintain acceptable performance with probability 1-δ (delta) under worst-case perturbations