Language Model Personalization via Reward Factorization

📝 Paper Summary

Conversational personalization

PReF personalizes language models by decomposing rewards into shared base functions and user-specific weights, enabling rapid adaptation to new users via uncertainty-based active learning.

Core Problem

Standard RLHF learns a single universal preference model that ignores individual variations, while training separate models for each user requires prohibitive amounts of data and compute.

Why it matters:

User preferences vary drastically (e.g., professional assistant vs. virtual friend), making 'average' alignment suboptimal for everyone
Existing personalization methods require thousands of user-specific data points, which is infeasible for scaling to millions of users
Naively maintaining separate LLMs for every user creates unsustainable computational and storage costs

Concrete Example: One user might prefer concise, professional answers for work, while another wants empathetic, verbose responses for companionship. A standard RLHF model averages these into a generic tone that satisfies neither. PReF adapts to the specific user using just ~10 pairwise comparisons.

Key Novelty

Personalization via Reward Factorization (PReF)

Hypothesizes that user rewards lie on a low-dimensional manifold, representable as a linear combination of learned 'base' reward functions
Initializes these base functions via Singular Value Decomposition (SVD) of preference matrices to handle data sparsity and non-convex optimization
Uses active learning (logistic bandits) to efficiently infer a new user's specific combination weights by selecting query pairs that maximize uncertainty

Architecture

Conceptual flow of the PReF framework: Offline Learning -> Online Adaptation -> Inference.

Evaluation Highlights

Achieves a 67% win rate against default GPT-4o responses in human evaluations after alignment
Surpasses the performance of a standard (non-personalized) reward model using only 5 feedback samples from a new user (synthetic experiments)
Infers robust user-specific reward coefficients using only 10-20 active learning questions

Breakthrough Assessment

7/10

Clever application of matrix factorization and active learning to RLHF. Significantly reduces the data barrier for personalization, though reliance on inference-time alignment (vs training) limits scope.

⚙️ Technical Details

Problem Definition

Setting: Personalized preference learning where each user i has a unique reward function r_i derived from pairwise comparisons

Inputs: Prompt x, candidate response pair (y1, y2), user identity i

Outputs: Predicted preference probability P(y1 > y2 | x, i)

Pipeline Flow

Base Function Learning (Offline): Learn shared features ϕ(x,y) from multi-user data
User Adaptation (Online): Interactive loop to infer user weights λ_i
Aligned Generation (Inference): Generate responses maximizing weighted reward

System Modules

Base Reward Network

Maps prompt-response pairs to a J-dimensional vector of base reward scores

Model or implementation: Neural network (architecture not specified)

Active Query Selector

Selects the next response pair to show the user to maximize information gain about their preferences

Model or implementation: Uncertainty sampling via Logistic Bandit theory

Alignment Engine

Generates final responses aligned with the inferred user reward function

Model or implementation: Inference-time alignment (e.g., Best-of-N, unspecified specific algorithm)

Novel Architectural Elements

Bilinear reward head architecture: Output is dot product of user embedding λ and feature vector ϕ (instead of scalar output)
SVD-based initialization scheme for neural reward network weights

Modeling

Base Model: Neural network for reward modeling (exact size/type not reported); GPT-4o for generation evaluation

Training Method: Two-stage Reward Learning: (1) SVD Initialization, (2) Regularized MLE

Objective Functions:

Purpose: Optimize reward model to match observed binary preferences.

Formally: Regularized MLE maximizing log σ(λ_i^T (ϕ(x,y1) - ϕ(x,y2))) - β||λ_i||^2
Purpose: Select most informative questions for new users.

Formally: Maximize uncertainty metric derived from the inverse Hessian of the log-likelihood

Adaptation: Inference-time adaptation (estimates user vector λ_i, keeps model ϕ fixed)

Training Data:

Dataset of prompts annotated by multiple users with different preferences
Represented as matrix A (preferences) and P (probabilities)

Key Hyperparameters:

user_samples_for_inference: 10-20 questions

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard RLHF: PReF learns user-specific weights rather than a global average
vs. Multi-Objective RLHF: PReF learns the base functions from data rather than using pre-defined heuristics
vs. Poddar et al.: PReF uses linear factorization and active learning for explicit weight inference
+ 1 more
vs. PMF (Probabilistic Matrix Factorization) [not cited in paper]: PReF applies factorization to the reward function of an LLM within a pairwise choice framework, rather than static rating matrices

Limitations

Optimization landscape for bilinear reward model is non-convex and sensitive to initialization
Requires active participation (10-20 questions) from every new user before full personalization
Relies on inference-time alignment, which may be higher latency than fine-tuning
Assumes user preferences are static and strictly follow the Bradley-Terry linear model

Reproducibility

Code: https://idanshen.github.io/PReF/

Code and demo available at project website. Paper relies on GPT-4o for evaluation which is closed source. Specific architecture of the reward neural network is not detailed in the text provided.

📊 Experiments & Results

Evaluation Setup

Personalized response generation validated by human users and synthetic user simulation

Benchmarks:

Synthetic User Simulation (Preference prediction) [New]
Real User Evaluation (Human preference ranking) [New]

Metrics:

Win Rate (vs Default GPT-4o)
Reward Model Accuracy (vs Standard RLHF)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Human Eval	Win Rate	50.0	67.0	+17.0
Synthetic Simulation	Samples to beat baseline	Not applicable	5	Not applicable

Main Takeaways

Personalization significantly improves user satisfaction (67% win rate) compared to generic strong baselines (GPT-4o).
The 'base reward' assumption holds sufficiently well to allow rapid adaptation with very few samples (5-20).
Active learning (uncertainty sampling) is crucial for minimizing the user burden during the personalization phase.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Matrix Factorization (SVD)
Active Learning / Bandits
Bradley-Terry Model

Key Terms

RLHF: Reinforcement Learning from Human Feedback—fine-tuning models to maximize a reward function learned from human preferences

Reward Factorization: Decomposing a user's reward into a set of shared 'base' features and user-specific weights

Bradley-Terry Model: A statistical model that predicts the probability of preferring one item over another based on their score difference

SVD: Singular Value Decomposition—a linear algebra method used here to initialize the reward components from sparse data

Logistic Bandits: An online learning framework where an agent selects actions (questions) to maximize information gain about a logistic reward model

Inference-time alignment: Techniques to steer model generation towards high-reward outputs during decoding without updating model weights (e.g., Best-of-N)