RLHF: Reinforcement Learning from Human Feedback—aligning models using human preference data
Bradley-Terry model: A statistical model for estimating the probability that one item is preferred over another based on their score difference
Personalization Tax: The degradation in general capabilities (safety, reasoning, chat quality) observed when a model is optimized for specific personal preferences
Inter-personal disagreement: The extent to which different users prefer different responses for the same input
Intra-personal consistency: How reliably a single user prefers the same type of response over time or across similar contexts
P-SOUPS: A synthetic dataset where users are simulated to have opposing preferences along dimensions like expertise and style
GPO: Group Preference Optimization—a meta-learning approach using a transformer module to predict preferences from few-shot examples
PRM: Personalized Reward Modeling—methods that explicitly condition the reward function on user embeddings or IDs