RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation

📝 Paper Summary

Personalization (P13N) Reward Modeling

Proposes two frameworks for RLHF with heterogeneous users: learning personalized reward models via shared representations, and aggregating probabilistic preferences via an incentive-compatible mechanism that ensures truthful reporting.

Core Problem

Standard RLHF assumes human preferences are homogeneous and honest, but real user populations have diverse, conflicting preferences and may strategically misreport feedback to manipulate the model.

Why it matters:

Assuming a single reward model for diverse populations leads to misalignment and poor performance for minority groups or specific user contexts.
Individual users often provide insufficient data to train standalone reward models, creating a 'cold start' problem for personalization.
Strategic users (e.g., in online rating systems) may provide extreme feedback to disproportionately influence the aggregate model, distorting the AI's alignment.

Concrete Example: In an online rating system, a user might rate a decent response as 'terrible' (extreme feedback) not because they hate it, but to drag the overall average closer to their personal preference, manipulating the aggregated reward model.

Key Novelty

Principled Personalization and Strategic-Aware Aggregation

Personalization: Uses representation learning to find a shared structure across all users, then learns individual 'heads' (parameters) for each user or cluster, trading off bias and variance.
Strategic Aggregation: Models feedback collection as a mechanism design problem. By using 'probabilistic opinion' feedback and specific cost mechanisms, it makes truthful reporting the optimal strategy for users.

Architecture

Conceptual illustration of the challenge of heterogeneous user preferences in RLHF (inferred from text description)

Evaluation Highlights

Establishes sample complexity guarantees for personalized reward learning using shared representations across heterogeneous users.
Proves that the proposed probabilistic opinion aggregation rule maximizes social welfare functions.
Demonstrates that the proposed feedback mechanism is Dominant Strategic Incentive-Compatible (DSIC), ensuring truthful preference reporting.

Breakthrough Assessment

7/10

Significant theoretical contribution introducing mechanism design and social choice theory to RLHF. Addresses the critical but overlooked problems of preference heterogeneity and strategic manipulation.

⚙️ Technical Details

Problem Definition

Setting: Learning reward models and policies from heterogeneous human feedback (pairwise comparisons or probabilistic opinions) under potential strategic behavior.

Inputs: Set of human users with distinct latent reward functions providing feedback (comparisons or probability vectors) on query-response pairs.

Outputs: Personalized reward models for individuals/clusters OR a single aggregated reward model/policy aligned with social welfare.

Pipeline Flow

Feedback Collection (Pairwise Comparisons OR Probabilistic Opinions)
Framework Selection (Personalization vs. Aggregation)
Personalization Path: Representation Learning -> User/Cluster Parameter Estimation
Aggregation Path: Individual Reward Estimation -> Social Welfare Aggregation -> Policy Optimization

System Modules

Feedback Interface

Collects feedback from N users; supports either standard pairwise comparisons or 'Probabilistic Opinion' vectors

Model or implementation: Not applicable (Interface)

Representation Learner (Personalization Track)

Learns a common low-dimensional representation shared across heterogeneous users to enable data pooling

Model or implementation: Feature map phi(x,y)

Cluster/User Estimator (Personalization Track)

Estimates specific weight vectors for individual users or clusters of users

Model or implementation: Linear weights w_i

Strategic Aggregator

Aggregates individual probabilistic opinions into a single consensus opinion while maximizing social welfare

Model or implementation: Aggregation Rule (e.g., weighted average based on social choice axioms)

Novel Architectural Elements

Dual-track framework allowing either Personalization (branching reward models) or Aggregation (unified reward model)
Incentive-compatible feedback mechanism incorporating user utility costs to force truthful reporting

Modeling

Base Model: Large Language Models (general framework)

Training Method: Theoretical Framework for Reward Learning (Representation Learning + Aggregation)

Objective Functions:

Purpose: Learn personalized parameters.

Formally: Minimize negative log-likelihood of preference data given reward model r_i(x) = phi(x)^T w_i
Purpose: Aggregate preferences.

Formally: Maximize Social Welfare Function (sum of individual utilities)
Purpose: Ensure truthfulness.

Formally: Design utility U_i(b_i) = V_i(Agg(b)) - Cost(b_i) such that reporting true p_i is optimal

Key Hyperparameters:

learning_rate: Not reported in the provided text
batch_size: Not reported in the provided text

Compute: Not reported in the provided text

Comparison to Prior Work

vs. Standard RLHF: Explicitly models heterogeneity and strategic behavior; provides personalization guarantees
vs. Chakraborty et al. (2024): Adds mechanism design for strategic truthfulness and focuses on welfare maximization rather than just egalitarianism
vs. Zhong et al. (2024): Zhong et al. focus on linear representations and Von Neumann winners; this paper covers general representations, clustering, and probabilistic opinion feedback with truthfulness mechanisms

Limitations

Theoretical guarantees rely on assumptions like the existence of a shared low-rank representation.
Requires specific feedback formats (e.g., probabilistic opinions) which may be harder for humans to provide than simple comparisons.
Clustering approach assumes 'diversity of user parameter vectors' (spanning the parameter space).
Experimental validation of the mechanism design with real human strategic behavior is not detailed in the provided text.

📊 Experiments & Results

Evaluation Setup

The provided text contains only the Introduction, Related Works, and Preliminaries. Detailed experimental results, benchmarks, and quantitative metrics are not included in the source text.

Metrics:

Social Welfare
Sample Complexity
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The paper theoretically establishes that pooling data via representation learning allows for sample-efficient personalized reward learning, overcoming the sparsity of individual data.
It proves that specific aggregation rules for probabilistic opinions are equivalent to reward aggregation rules satisfying six pivotal social choice axioms (under Plackett-Luce).
It demonstrates that a properly designed cost mechanism can neutralize strategic behavior, making truthful reporting the dominant strategy (DSIC) while maximizing social welfare.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Social Choice Theory (Arrow's Impossibility Theorem)
Mechanism Design (Vickrey-Clarke-Groves mechanisms)
Representation Learning (Low-rank MDPs/functions)

Key Terms

RLHF: Reinforcement Learning from Human Feedback—fine-tuning AI models using human preferences

Mechanism Design: A field of economics and game theory essentially about designing rules of a game to achieve a specific outcome (like truthfulness) from strategic players

Social Choice Theory: A theoretical framework for analyzing how to combine individual opinions/preferences into a collective decision

Probabilistic Opinion: A feedback format where users assign a probability distribution over answers (indicating intensity of preference) rather than just selecting one

DSIC: Dominant Strategic Incentive-Compatible—a property of a mechanism where telling the truth is always the best strategy for a participant, regardless of what others do

Social Welfare: A measure of the overall well-being or satisfaction of a group, often defined as the sum of individual utilities

Sample Complexity: The number of training samples required to learn a model to a desired level of accuracy

Homogeneous vs. Heterogeneous: Homogeneous assumes all users are the same; Heterogeneous acknowledges users have different, distinct preference functions