RM: Reward Model—a model trained to predict human preferences (usually a scalar score) between two pieces of text
RLHF: Reinforcement Learning from Human Feedback—a method to fine-tune language models using a reward model trained on human preference data
Bradley-Terry Model: A probability model used to predict the outcome of a paired comparison; here, it converts scalar rewards into probabilities of one answer being preferred over another
Jensen-Shannon Distance: A symmetric metric measuring the similarity between two probability distributions (used here for non-ordinal opinions)
Wasserstein Distance: A distance function between probability distributions defined on a metric space (used here for ordinal opinions, treating them as points on a line)
Opinion Distribution: A probability distribution over possible answers to a question, derived by applying a softmax function to the RM's reward scores for each answer
Steerability: The ability to change a model's behavior or opinions by providing context or instructions in the prompt (e.g., 'Answer as a liberal')
Friedman test: A non-parametric statistical test used to detect differences in treatments (here, differences in alignment ranks between demographic groups)