Idan Shenfeld, Felix Faltings, Pulkit Agrawal, Aldo Pacchiano
Boston University
arXiv.org
(2025)
P13NRL
📝 Paper Summary
Conversational personalization
PReF personalizes language models by decomposing rewards into shared base functions and user-specific weights, enabling rapid adaptation to new users via uncertainty-based active learning.
Core Problem
Standard RLHF learns a single universal preference model that ignores individual variations, while training separate models for each user requires prohibitive amounts of data and compute.
Why it matters:
User preferences vary drastically (e.g., professional assistant vs. virtual friend), making 'average' alignment suboptimal for everyone
Existing personalization methods require thousands of user-specific data points, which is infeasible for scaling to millions of users
Naively maintaining separate LLMs for every user creates unsustainable computational and storage costs
Concrete Example:One user might prefer concise, professional answers for work, while another wants empathetic, verbose responses for companionship. A standard RLHF model averages these into a generic tone that satisfies neither. PReF adapts to the specific user using just ~10 pairwise comparisons.
Key Novelty
Personalization via Reward Factorization (PReF)
Hypothesizes that user rewards lie on a low-dimensional manifold, representable as a linear combination of learned 'base' reward functions
Initializes these base functions via Singular Value Decomposition (SVD) of preference matrices to handle data sparsity and non-convex optimization
Uses active learning (logistic bandits) to efficiently infer a new user's specific combination weights by selecting query pairs that maximize uncertainty
Architecture
Conceptual flow of the PReF framework: Offline Learning -> Online Adaptation -> Inference.
Evaluation Highlights
Achieves a 67% win rate against default GPT-4o responses in human evaluations after alignment
Surpasses the performance of a standard (non-personalized) reward model using only 5 feedback samples from a new user (synthetic experiments)
Infers robust user-specific reward coefficients using only 10-20 active learning questions
Breakthrough Assessment
7/10
Clever application of matrix factorization and active learning to RLHF. Significantly reduces the data barrier for personalization, though reliance on inference-time alignment (vs training) limits scope.
⚙️ Technical Details
Problem Definition
Setting: Personalized preference learning where each user i has a unique reward function r_i derived from pairwise comparisons
Inputs: Prompt x, candidate response pair (y1, y2), user identity i
Outputs: Predicted preference probability P(y1 > y2 | x, i)
Pipeline Flow
Base Function Learning (Offline): Learn shared features ϕ(x,y) from multi-user data
User Adaptation (Online): Interactive loop to infer user weights λ_i
vs. PMF (Probabilistic Matrix Factorization) [not cited in paper]: PReF applies factorization to the reward function of an LLM within a pairwise choice framework, rather than static rating matrices
Limitations
Optimization landscape for bilinear reward model is non-convex and sensitive to initialization
Requires active participation (10-20 questions) from every new user before full personalization
Relies on inference-time alignment, which may be higher latency than fine-tuning
Assumes user preferences are static and strictly follow the Bradley-Terry linear model
Code and demo available at project website. Paper relies on GPT-4o for evaluation which is closed source. Specific architecture of the reward neural network is not detailed in the text provided.
📊 Experiments & Results
Evaluation Setup
Personalized response generation validated by human users and synthetic user simulation
Benchmarks:
Synthetic User Simulation (Preference prediction) [New]
Real User Evaluation (Human preference ranking) [New]
Metrics:
Win Rate (vs Default GPT-4o)
Reward Model Accuracy (vs Standard RLHF)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Human Eval
Win Rate
50.0
67.0
+17.0
Synthetic Simulation
Samples to beat baseline
Not applicable
5
Not applicable
Main Takeaways
Personalization significantly improves user satisfaction (67% win rate) compared to generic strong baselines (GPT-4o).
The 'base reward' assumption holds sufficiently well to allow rapid adaptation with very few samples (5-20).
Active learning (uncertainty sampling) is crucial for minimizing the user burden during the personalization phase.
📚 Prerequisite Knowledge
Prerequisites
Reinforcement Learning from Human Feedback (RLHF)
Matrix Factorization (SVD)
Active Learning / Bandits
Bradley-Terry Model
Key Terms
RLHF: Reinforcement Learning from Human Feedback—fine-tuning models to maximize a reward function learned from human preferences
Reward Factorization: Decomposing a user's reward into a set of shared 'base' features and user-specific weights
Bradley-Terry Model: A statistical model that predicts the probability of preferring one item over another based on their score difference
SVD: Singular Value Decomposition—a linear algebra method used here to initialize the reward components from sparse data
Logistic Bandits: An online learning framework where an agent selects actions (questions) to maximize information gain about a logistic reward model
Inference-time alignment: Techniques to steer model generation towards high-reward outputs during decoding without updating model weights (e.g., Best-of-N)