CoPL: Collaborative Preference Learning for Personalizing LLMs

📝 Paper Summary

Personalized Reward Modeling Preference Alignment

CoPL personalizes LLM reward models by constructing a user-response graph to propagate preference signals via message passing and dynamically routing inputs to user-specific LoRA experts.

Core Problem

Existing personalized reward models struggle with sparse annotations because latent variable methods fail to align similar users if their annotated response sets are disjoint (do not overlap).

Why it matters:

Users often provide very few preference labels, making it difficult to learn accurate individual profiles without leveraging similarities to others
Standard methods map users with identical preferences to distant embedding points if they rate different items, preventing effective generalization
Current approaches either rely on inflexible pre-defined clusters or fail to capture controversial preferences where user groups disagree

Concrete Example: User 1 prefers A>B; User 2 prefers C>D. Even if they share the same underlying taste (e.g., preference for brevity), a standard latent model sees no connection between them because their annotated items {A,B} and {C,D} are disjoint. CoPL connects them via a graph (e.g., User 1 -> Item X -> User 3 -> Item Y -> User 2) to infer similarity.

Key Novelty

Graph-based Collaborative Preference Learning with Mixture of LoRA Experts

Constructs a bipartite graph (Users <-> Responses) where edges represent preferences, allowing signals to propagate through multi-hop connections even when direct overlap is missing
Uses the learned graph embeddings to gate a Mixture of LoRA Experts (MoLE), dynamically adapting the reward model's parameters to specific user profiles during inference
Enables optimization-free adaptation for new users by aggregating embeddings from similar existing users found via graph neighbors, avoiding the need for retraining

Architecture

The overall training pipeline of CoPL. Left: User-Response Bipartite Graph Construction. Middle: GNN Message Passing to learn embeddings. Right: MoLE-based Reward Model using user embeddings for gating.

Evaluation Highlights

Consistently outperforms personalized baselines (I2E, VPL, PAL) on both seen and unseen users across TL;DR, UltraFeedback-P, and PersonalLLM datasets
Achieves performance comparable to a Group-Oracle (which has perfect knowledge of user groups) on controversial pairs where user preferences diverge
Maintains robustness in sparse settings (e.g., only 8 annotations) where baseline performance typically degrades

Breakthrough Assessment

7/10

Clever integration of collaborative filtering (common in RecSys) into LLM alignment. Addresses the realistic 'sparse data' problem effectively. The optimization-free adaptation is a practical strong point.

⚙️ Technical Details

Problem Definition

Setting: Personalized Reward Modeling using Pairwise Preferences

Inputs: User ID u, Question q, Two candidate responses (a, b)

Outputs: Scalar preference score f(u, r) indicating how much user u likes response r

Pipeline Flow

Graph Construction: Build User-Response Bipartite Graph from preference data
User Representation Learning: GNN Message Passing updates user embeddings
Personalized Reward Prediction: LLM with MoLE uses user embedding to gate experts and predict score

System Modules

Graph Encoder

Learn user embeddings by propagating preference signals across the bipartite graph

Model or implementation: GNN (Message Passing)

Gating Mechanism (Reward Modeling)

Determine which LoRA experts to activate for the current user

Model or implementation: MLP (2-layer)

Expert Layers (Reward Modeling)

Apply user-specific parameter adaptations to the LLM backbone

Model or implementation: Mixture of LoRA Experts (MoLE)

Reward Head (Reward Modeling)

Output final scalar reward

Model or implementation: Linear Head on LLM backbone

Novel Architectural Elements

Integration of GNN-based collaborative filtering embeddings directly into an LLM MoLE gating mechanism
User-response bipartite graph construction specifically for preference propagation in reward modeling

Modeling

Base Model: gemma-2b-it and gemma-7b-it

Training Method: Supervised Fine-Tuning (Reward Modeling) with BTL Loss

Objective Functions:

Purpose: Optimize user and response embeddings to reflect observed preferences in the graph.

Formally: Binary Cross Entropy on the inner product of user and response embeddings (Eq. 5)
Purpose: Train the LLM reward model to assign higher scores to preferred responses.

Formally: Negative Log Likelihood of the BTL model: -log(sigmoid(f(u,a) - f(u,b))) for pairs where a > b

Adaptation: Mixture of LoRA Experts (1 shared + 8 experts, rank=8)

Trainable Parameters: GNN weights, User/Response Embeddings, LoRA Experts, Gating Network

Key Hyperparameters:

lora_rank: 8 (per expert)
num_experts: 8
gating_mlp_layers: 2
+ 1 more
baseline_lora_rank: 64 (for comparison models)

Compute: Not reported in the paper

Comparison to Prior Work

vs. I2E/PAL: CoPL uses GNNs to leverage collaborative signals (user-user similarity) rather than treating users independently or as isolated parameters
vs. VPL: CoPL handles sparse data better by propagating signals through the graph, whereas VPL struggles when user interactions are too few to form a good latent prior
vs. Multi-Model approaches: CoPL uses a single model with dynamic routing (MoLE) rather than maintaining separate models for different preference types

Limitations

Dependency on graph connectivity; performance may vary if the graph is extremely fragmented with no overlapping items
Majority bias observed in imbalanced group settings (1:9 ratio), requiring mitigation strategies like focal loss
Computational overhead of GNN message passing during the user representation learning phase

Reproducibility

Code: https://github.com/ml-postech/CoPL

Code is publicly available at https://github.com/ml-postech/CoPL. Datasets (TL;DR, UltraFeedback-P, PersonalLLM) are standard or described in prior work.

📊 Experiments & Results

Evaluation Setup

Reward Model Accuracy on held-out preference pairs for Seen and Unseen users

Benchmarks:

TL;DR (Summarization (Short vs. Long preference groups))
UltraFeedback-P (UF-P) (General Instruction Following (Constructed with 2 or 4 preference groups))
PersonalLLM (Personalized Text Generation (Mixture of 4 preference dimensions))

Metrics:

Pairwise Accuracy
Statistical methodology: Experiments repeated three times with different seeds

Experiment Figures

t-SNE visualization of learned user embeddings comparing CoPL with baselines in a sparse annotation setting.

Heatmap of expert allocation across different user groups in the MoLE layers.

Main Takeaways

CoPL consistently outperforms baselines (I2E, VPL, PAL, Uniform) across all datasets for both seen and unseen users, often matching the oracle model that has access to ground-truth groups.
In sparse settings (e.g., PersonalLLM with average annotations), CoPL remains robust while baselines like VPL degrade, validating the benefit of collaborative signal propagation.
On 'controversial' pairs (where user groups disagree), CoPL significantly outperforms baselines, which tend to collapse to the average/majority preference.
The MoLE architecture is efficient: CoPL uses fewer activated parameters during inference (Rank 8 expert) compared to baselines using a single large LoRA (Rank 64), yet achieves higher accuracy.
Unseen user adaptation works effectively: aggregating embeddings from 2-hop neighbors allows the model to predict preferences for new users without any gradient updates.

📚 Prerequisite Knowledge

Prerequisites

Bradley-Terry-Luce (BTL) Model
Graph Neural Networks (GNNs) / Message Passing
Low-Rank Adaptation (LoRA)
Mixture of Experts (MoE)
Collaborative Filtering

Key Terms

BTL: Bradley-Terry-Luce—a probabilistic model that predicts the likelihood of one item being preferred over another based on their reward scores

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main model and trains small decomposition matrices

MoLE: Mixture of LoRA Experts—an architecture where multiple LoRA adapters serve as 'experts', and a gating mechanism selects which ones to use for a given input

Bipartite Graph: A graph with two distinct sets of nodes (here, Users and Responses), where edges only connect nodes from different sets

Message Passing: A GNN mechanism where nodes update their embeddings by aggregating information from their neighbors

Sparse Annotation: A setting where each user provides very few preference labels (e.g., 8-16 pairs) relative to the total number of items