Youngbin Choi, Seunghyuk Cho, Minjong Lee, MoonJeong Park, Yesong Ko, Jungseul Ok, Dongwoo Kim
Pohang University of Science and Technology
arXiv
(2025)
P13NRLRecommendation
📝 Paper Summary
Personalized Reward ModelingPreference Alignment
CoPL personalizes LLM reward models by constructing a user-response graph to propagate preference signals via message passing and dynamically routing inputs to user-specific LoRA experts.
Core Problem
Existing personalized reward models struggle with sparse annotations because latent variable methods fail to align similar users if their annotated response sets are disjoint (do not overlap).
Why it matters:
Users often provide very few preference labels, making it difficult to learn accurate individual profiles without leveraging similarities to others
Standard methods map users with identical preferences to distant embedding points if they rate different items, preventing effective generalization
Current approaches either rely on inflexible pre-defined clusters or fail to capture controversial preferences where user groups disagree
Concrete Example:User 1 prefers A>B; User 2 prefers C>D. Even if they share the same underlying taste (e.g., preference for brevity), a standard latent model sees no connection between them because their annotated items {A,B} and {C,D} are disjoint. CoPL connects them via a graph (e.g., User 1 -> Item X -> User 3 -> Item Y -> User 2) to infer similarity.
Key Novelty
Graph-based Collaborative Preference Learning with Mixture of LoRA Experts
Constructs a bipartite graph (Users <-> Responses) where edges represent preferences, allowing signals to propagate through multi-hop connections even when direct overlap is missing
Uses the learned graph embeddings to gate a Mixture of LoRA Experts (MoLE), dynamically adapting the reward model's parameters to specific user profiles during inference
Enables optimization-free adaptation for new users by aggregating embeddings from similar existing users found via graph neighbors, avoiding the need for retraining
Architecture
The overall training pipeline of CoPL. Left: User-Response Bipartite Graph Construction. Middle: GNN Message Passing to learn embeddings. Right: MoLE-based Reward Model using user embeddings for gating.
Evaluation Highlights
Consistently outperforms personalized baselines (I2E, VPL, PAL) on both seen and unseen users across TL;DR, UltraFeedback-P, and PersonalLLM datasets
Achieves performance comparable to a Group-Oracle (which has perfect knowledge of user groups) on controversial pairs where user preferences diverge
Maintains robustness in sparse settings (e.g., only 8 annotations) where baseline performance typically degrades
Breakthrough Assessment
7/10
Clever integration of collaborative filtering (common in RecSys) into LLM alignment. Addresses the realistic 'sparse data' problem effectively. The optimization-free adaptation is a practical strong point.
⚙️ Technical Details
Problem Definition
Setting: Personalized Reward Modeling using Pairwise Preferences
Inputs: User ID u, Question q, Two candidate responses (a, b)
Outputs: Scalar preference score f(u, r) indicating how much user u likes response r
Pipeline Flow
Graph Construction: Build User-Response Bipartite Graph from preference data
User Representation Learning: GNN Message Passing updates user embeddings
Personalized Reward Prediction: LLM with MoLE uses user embedding to gate experts and predict score
System Modules
Graph Encoder
Learn user embeddings by propagating preference signals across the bipartite graph
Model or implementation: GNN (Message Passing)
Gating Mechanism (Reward Modeling)
Determine which LoRA experts to activate for the current user
Model or implementation: MLP (2-layer)
Expert Layers (Reward Modeling)
Apply user-specific parameter adaptations to the LLM backbone
Model or implementation: Mixture of LoRA Experts (MoLE)
Reward Head (Reward Modeling)
Output final scalar reward
Model or implementation: Linear Head on LLM backbone
Novel Architectural Elements
Integration of GNN-based collaborative filtering embeddings directly into an LLM MoLE gating mechanism
User-response bipartite graph construction specifically for preference propagation in reward modeling
Modeling
Base Model: gemma-2b-it and gemma-7b-it
Training Method: Supervised Fine-Tuning (Reward Modeling) with BTL Loss
Objective Functions:
Purpose: Optimize user and response embeddings to reflect observed preferences in the graph.
Formally: Binary Cross Entropy on the inner product of user and response embeddings (Eq. 5)
Purpose: Train the LLM reward model to assign higher scores to preferred responses.
Formally: Negative Log Likelihood of the BTL model: -log(sigmoid(f(u,a) - f(u,b))) for pairs where a > b
vs. I2E/PAL: CoPL uses GNNs to leverage collaborative signals (user-user similarity) rather than treating users independently or as isolated parameters
vs. VPL: CoPL handles sparse data better by propagating signals through the graph, whereas VPL struggles when user interactions are too few to form a good latent prior
vs. Multi-Model approaches: CoPL uses a single model with dynamic routing (MoLE) rather than maintaining separate models for different preference types
Limitations
Dependency on graph connectivity; performance may vary if the graph is extremely fragmented with no overlapping items
Majority bias observed in imbalanced group settings (1:9 ratio), requiring mitigation strategies like focal loss
Computational overhead of GNN message passing during the user representation learning phase
Code is publicly available at https://github.com/ml-postech/CoPL. Datasets (TL;DR, UltraFeedback-P, PersonalLLM) are standard or described in prior work.
📊 Experiments & Results
Evaluation Setup
Reward Model Accuracy on held-out preference pairs for Seen and Unseen users
Benchmarks:
TL;DR (Summarization (Short vs. Long preference groups))
UltraFeedback-P (UF-P) (General Instruction Following (Constructed with 2 or 4 preference groups))
PersonalLLM (Personalized Text Generation (Mixture of 4 preference dimensions))
Metrics:
Pairwise Accuracy
Statistical methodology: Experiments repeated three times with different seeds
Experiment Figures
t-SNE visualization of learned user embeddings comparing CoPL with baselines in a sparse annotation setting.
Heatmap of expert allocation across different user groups in the MoLE layers.
Main Takeaways
CoPL consistently outperforms baselines (I2E, VPL, PAL, Uniform) across all datasets for both seen and unseen users, often matching the oracle model that has access to ground-truth groups.
In sparse settings (e.g., PersonalLLM with average annotations), CoPL remains robust while baselines like VPL degrade, validating the benefit of collaborative signal propagation.
On 'controversial' pairs (where user groups disagree), CoPL significantly outperforms baselines, which tend to collapse to the average/majority preference.
The MoLE architecture is efficient: CoPL uses fewer activated parameters during inference (Rank 8 expert) compared to baselines using a single large LoRA (Rank 64), yet achieves higher accuracy.
Unseen user adaptation works effectively: aggregating embeddings from 2-hop neighbors allows the model to predict preferences for new users without any gradient updates.
📚 Prerequisite Knowledge
Prerequisites
Bradley-Terry-Luce (BTL) Model
Graph Neural Networks (GNNs) / Message Passing
Low-Rank Adaptation (LoRA)
Mixture of Experts (MoE)
Collaborative Filtering
Key Terms
BTL: Bradley-Terry-Luce—a probabilistic model that predicts the likelihood of one item being preferred over another based on their reward scores
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main model and trains small decomposition matrices
MoLE: Mixture of LoRA Experts—an architecture where multiple LoRA adapters serve as 'experts', and a gating mechanism selects which ones to use for a given input
Bipartite Graph: A graph with two distinct sets of nodes (here, Users and Responses), where edges only connect nodes from different sets
Message Passing: A GNN mechanism where nodes update their embeddings by aggregating information from their neighbors
Sparse Annotation: A setting where each user provides very few preference labels (e.g., 8-16 pairs) relative to the total number of items