PEFT-U: Parameter-Efficient Fine-Tuning for User Personalization

📝 Paper Summary

Personalized LLMs Subjective NLP Tasks Parameter-Efficient Fine-Tuning (PEFT)

The paper introduces PEFT-U, a benchmark for subjective tasks where users disagree on identical inputs, and demonstrates that parameter-efficient fine-tuning outperforms prompting for modeling these individual perspectives.

Core Problem

LLMs typically employ a 'one-size-fits-all' approach that aggregates user data into a single ground truth, failing to accommodate subjective tasks where different users validly hold conflicting labels for the exact same input.

Why it matters:

Subjective applications like Hate Speech Detection and Humor Analysis depend entirely on individual user perspective, which generalized models ignore by favoring majority voting
Deploying separate full fine-tuned models for every user is computationally prohibitive in production environments
Existing benchmarks usually discard disagreement as 'noise', thereby removing the very signals needed to train personalized models

Concrete Example: In the HateXplain dataset, one user might label the phrase 'right definitely not going back to the fag hag thing' as 'normal', while another user labels the exact same text as 'offensive'. A standard LLM trained on majority vote would force a single label, ignoring the specific user's context.

Key Novelty

PEFT-U Benchmark & Evaluation Framework

Reconstructs 13+ NLP datasets by treating individual annotators as distinct users, specifically filtering for tasks with low inter-annotator agreement (Krippendorff’s alpha ≤ 0.5) to ensure personalization is required
Comparative analysis of 'Parametric' personalization (updating specific weights via Adapters/LoRA per user) versus 'Non-Parametric' personalization (prompting with user examples)

Evaluation Highlights

Adapters achieved the highest overall accuracy of 64.4% across 13 personalized tasks, outperforming LoRA (59.5%)
Adapters outperformed other methods on 12 out of the 13 PEFT-U tasks
Personalized fine-tuning methods consistently outperformed Zero-shot and Few-shot prompting baselines on average

Breakthrough Assessment

7/10

Significant contribution in benchmarking subjective tasks where 'ground truth' varies by user. The finding that Adapters outperform LoRA in this specific setting is a useful empirical insight for personalization.

⚙️ Technical Details

Problem Definition

Setting: Personalized classification on subjective tasks

Inputs: Input text x and User ID/Context u

Outputs: User-specific label y_u (which may differ from y_v for user v on the same x)

Pipeline Flow

User Data Selection (Filter for n=10 samples, low agreement tasks)
Prompt Construction (Instruction-style prompts per dataset)
Model Processing (Flan-T5 + User-Specific PEFT Modules)
Output Generation (Classification Label)

System Modules

Base LLM

Frozen language model providing general language understanding

Model or implementation: Flan-T5 (Base, Large, XL variants implied by PEFT context)

PEFT Module

Learn user-specific or task-specific adjustments to the base model representations

Model or implementation: Varied: LoRA, Adapters, IA3, Prompt Tuning, Prefix-Tuning, P-Tuning

Modeling

Base Model: Flan-T5

Training Method: Supervised Fine-Tuning of PEFT parameters

Adaptation: Comparisons of: LoRA, Adapters (Houlsby), Prompt Tuning, Prefix-Tuning, P-Tuning, IA3

Training Data:

13+ datasets transformed into user-centric format
Users with <10 samples discarded
80/10/10 split (Train/Dev/Test) per user

Key Hyperparameters:

optimizer: AdamW
weight_decay: 0.01
learning_rate: 2e-5
+ 3 more
batch_size: 16
epochs: 8
scheduler: Cosine schedule (linear warmup for first 10% steps)

Compute: NVIDIA RTX 3090 24GB GPUs

Comparison to Prior Work

vs. Salemi et al.: PEFT-U explicitly enforces conflicting user perspectives (Krippendorff’s alpha ≤ 0.5) to test handling of identical inputs with different labels
vs. Standard NLP Resources: PEFT-U treats annotators as distinct users rather than discarding outliers to find a 'majority vote' ground truth

Limitations

User perspective is modeled solely based on output preferences, ignoring demographics (age, gender, etc.)
Benchmark may not fully represent all intricacies of real-world communication scenarios
Evaluation is limited to classification tasks, not generation
Requires at least 10 samples per user, which may not be available in cold-start scenarios

Reproducibility

Code: https://github.com/ChrisIsKing/Parameter-Efficient-Personalization

Code, models, and benchmark are publicly released on GitHub. Datasets are reconstructed from existing public datasets (HateXplain, TweetEval, etc.) using specified filtering criteria.

📊 Experiments & Results

Evaluation Setup

Multi-task user-specific classification

Benchmarks:

PEFT-U Benchmark (Personalized Classification (Hate Speech, Sentiment, Humor)) [New]

Metrics:

Average per-user accuracy per task
Average accuracy across all tasks
Statistical methodology: Multiple runs with varied random seeds (reported in methodology, though specific confidence intervals are not in text text)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
PEFT-U (All 13 Tasks)	Average Accuracy	59.5	64.4	+4.9
PEFT-U (Task Count)	Number of tasks won	1	12	+11

Main Takeaways

Personalized fine-tuning methods (Adapters, LoRA) consistently outperform zero-shot and few-shot prompting, confirming that prompt context alone is insufficient for capturing complex user subjectivity.
Adapters (bottleneck layers) generally outperform LoRA (rank decomposition) in this specific benchmark setting, achieving the highest accuracy on 12/13 tasks.
Performance varies significantly across tasks (e.g., Subjective Discourse vs. MeasuringHateSpeech), indicating the benchmark presents a multifaceted challenge.
Parameter efficiency trade-off: While Adapters perform best overall, LoRA can outperform Adapters if the number of trainable parameters is equalized (as shown in the TweetEval ablation).

📚 Prerequisite Knowledge

Prerequisites

Parameter-Efficient Fine-Tuning (PEFT) architectures
Inter-annotator agreement metrics
Subjective NLP tasks (Hate speech, Sentiment)

Key Terms

PEFT: Parameter-Efficient Fine-Tuning—methods that fine-tune a small number of extra parameters while freezing the base pre-trained model to save compute

LoRA: Low-Rank Adaptation—a PEFT method that injects trainable rank decomposition matrices into transformer layers

Adapters: A PEFT method (Houlsby et al.) that adds trainable bottleneck layers after the feedforward networks in transformer layers

Krippendorff’s alpha: A statistical measure of the agreement achieved when coding a set of units of analysis; low values indicate high disagreement/subjectivity

Prompt Tuning: A method that learns continuous soft-prompt embeddings prepended to the input text

Subjective Discourse: A dataset collecting subjective interpretations of question-response pairs in congressional hearings

Rasch measurement theory: A psychometric model used in the 'MeasuringHateSpeech' dataset to adjust for annotator perspectives