Personalized Large Language Models

📝 Paper Summary

Subjective text perception User-specific fine-tuning

The paper demonstrates that fine-tuning LLMs with simple user identifiers significantly outperforms zero-shot and few-shot in-context learning for subjective tasks like emotion recognition and hate speech detection.

Core Problem

Standard LLMs are trained to be universal and objective, often failing to capture the highly subjective nature of tasks where 'correct' labels depend on individual user biases and preferences.

Why it matters:

Subjective tasks (emotion, hate speech, humor) rely heavily on individual interpretation, meaning a 'one-size-fits-all' model often misclassifies valid personal perspectives
Current zero-shot methods do not permanently update model weights to reflect user history, leading to inconsistent or generic responses that ignore individual context
Personalization is critical for user satisfaction in recommendation systems and chatbots, yet LLM personalization for subjective text perception remains under-explored

Concrete Example: In hate speech detection, one user might find a comment 'antagonistic' while another finds it 'healthy'. A standard LLM predicts a single label based on general consensus, ignoring the specific user's sensitivity or history, leading to a prediction that matches neither user's view.

Key Novelty

User-ID-Based Personalized Fine-Tuning (CLS-P / LM-P)

Incorporates a unique User ID token directly into the prompt during fine-tuning, allowing the model to learn user-specific embeddings or biases alongside text features
Compares two distinct architectures for personalization: adding a classification head (CLS-P) versus treating the problem as text generation (LM-P), identifying which works best for different label complexities

Architecture

Comparison of three workflows: Non-personalized Query (Zero-shot), Personalized Classification (CLS-P), and Personalized Generation (LM-P).

Evaluation Highlights

+164.2% performance gain (relative) on Unhealthy Conversations using Mistral 7B with personalized fine-tuning (CLS-P) compared to non-personalized baseline
+64.1% performance gain on GoEmotions using Mistral 7B with personalized fine-tuning (CLS-P) compared to non-personalized baseline
Fine-tuned personalization (CLS-P) consistently outperforms few-shot in-context learning (Q-2S), even when using powerful models like GPT-4

Breakthrough Assessment

7/10

Provides strong empirical evidence that simple ID-based fine-tuning is highly effective for subjective tasks, significantly beating strong few-shot baselines. The method is simple but the gains are massive.

⚙️ Technical Details

Problem Definition

Setting: Multi-label classification of subjective text where the ground truth depends on the specific annotator (user)

Inputs: Text input T and user context Cu (specifically User ID)

Outputs: Predicted subjective label(s) Ŷu matching user u's perception

Pipeline Flow

Input Construction: Combine Text + User ID (for personalized methods) or Text + Examples (for few-shot)
Model Processing: Pass through Pre-trained LLM (frozen or fine-tuned)
Output Generation: Either Classification Head (logits) or Language Modeling Head (text generation)

System Modules

Input Construction

Formats the prompt with User ID (e.g., '### User ID: <id>') or few-shot examples

Model or implementation: Rule-based formatting

LLM Backbone

Extracts features from text/user context

Model or implementation: Mistral 7B / Flan-T5-XL / Phi-2 / StableLM 3B (with LoRA adapters)

Head Layer

Predicts final labels

Model or implementation: Classification Head (newly initialized) OR LM Head

Novel Architectural Elements

Direct integration of User ID tokens into instruction-tuning prompts (CLS-P/LM-P) as the primary mechanism for personalization, compared against context-based prompting
Parallel evaluation of classification-head fine-tuning (CLS-P) vs. generative fine-tuning (LM-P) specifically for subjective personalization

Modeling

Base Model: Mistral 7B, Flan-T5-XL (3B), Phi-2 (2.7B), StableLM 3B

Training Method: Supervised Fine-Tuning (SFT) using qLoRA

Objective Functions:

Purpose: Minimize difference between predicted class probabilities and user's specific labels.

Formally: Cross-entropy loss (for CLS/CLS-P) or Causal Language Modeling loss (for LM/LM-P)

Adaptation: qLoRA (4-bit quantization, LoRA adapters on linear layers)

Trainable Parameters: LoRA adapters + Head layer (Classification or LM head)

Training Data:

GoEmotions: ~146k train samples, ~18k val/test. 72 annotators.
Unhealthy Conversations: ~168k train samples, ~20k val/test. 427 annotators.

Key Hyperparameters:

quantization: 4-bit NormalFloat (NF4)
precision: fp16 (Mistral, StableLM) / bf16 (Flan-T5, Phi-2)
LoRA_target: All linear layers except last

Compute: Four NVIDIA GeForce RTX 3090 GPUs (24GB vRAM each)

Comparison to Prior Work

vs. In-Context Learning: Updates model weights to permanently store user preferences rather than relying on context window
vs. Non-personalized Fine-Tuning: Explicitly conditions the model on User ID, enabling distinct predictions for the same text based on who is asking
vs. Personalized Embeddings [not cited in paper]: Uses simple User ID tokens in text prompt rather than learning separate dense user embedding vectors to concatenate with inputs

Limitations

Requires User ID to be present during training (cold start problem for new users not addressed)
Experiments limited to two English datasets (GoEmotions, Unhealthy Conversations)
Requires retraining/fine-tuning to add new users (unlike in-context learning)
Performance gain varies significantly by dataset complexity (number of labels)

Reproducibility

Code: https://github.com/Rikain/llm-finetuning

Code and datasets are publicly available at https://github.com/Rikain/llm-finetuning. Models used are open weights (Mistral, Flan-T5, etc.). Training hyperparameters (precision, quantization) are specified.

📊 Experiments & Results

Evaluation Setup

Subjective multi-label classification on datasets with multiple annotators per text

Benchmarks:

GoEmotions (Emotion Recognition (28 classes))
Unhealthy Conversations (Hate Speech / Toxic attributes detection (7 classes))

Metrics:

F1-macro score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Personalized fine-tuning (CLS-P/LM-P) vs Non-personalized baselines (CLS/LM). Shows massive gains from adding User ID.
Unhealthy Conversations	F1-macro	23.10	52.83	+29.73
GoEmotions	F1-macro	26.77	43.94	+17.17
Comparison of Fine-Tuning vs. In-Context Learning (Few-Shot) with GPT-4.
GoEmotions	F1-macro	26.64	34.52	+7.88
Unhealthy Conversations	F1-macro	30.57	51.65	+21.08
Comparison of Architecture: Encoder-Decoder (Flan-T5) vs Decoder-Only (Mistral).
GoEmotions	F1-macro	43.94	45.68	+1.74
Unhealthy Conversations	F1-macro	52.83	59.42	+6.59

Experiment Figures

Bar chart of performance gains (%) on GoEmotions dataset for varying models.

Bar chart of performance gains (%) on Unhealthy Conversations dataset.

Main Takeaways

Personalized fine-tuning (adding User ID) yields consistent and large performance gains over non-personalized baselines across all models and datasets.
Fine-tuning (adjusting weights) is far more effective for personalization than In-Context Learning (prompting), even when comparing smaller fine-tuned models (Mistral 7B) against larger prompting models (GPT-4).
Encoder-Decoder architectures (Flan-T5) can outperform larger Decoder-only models (Mistral) on discriminative classification tasks (CLS settings).
The choice between Classification (CLS-P) and Generative (LM-P) personalization depends on the dataset: CLS-P was better for GoEmotions (many labels), while LM-P was competitive for Unhealthy Conversations (fewer labels).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and fine-tuning
Familiarity with In-Context Learning (few-shot prompting)
Basic knowledge of LoRA (Low-Rank Adaptation) for efficient training

Key Terms

CLS-P: Personalized Classification—Fine-tuning an LLM with a classification head and User ID input to predict class labels

LM-P: Personalized Language Modeling—Fine-tuning an LLM to generate text labels (e.g., 'anger, joy') given a User ID and input text

qLoRA: Quantized Low-Rank Adaptation—A memory-efficient fine-tuning technique that uses quantized weights (e.g., 4-bit) and low-rank adapters

In-Context Learning: A technique where the model is given examples (shots) in the prompt to understand the task without updating its weights

Subjective tasks: NLP tasks where the 'correct' answer varies between people, such as emotion recognition or hate speech detection

F1-macro: An evaluation metric that calculates the F1 score (harmonic mean of precision and recall) for each class and then averages them, treating all classes equally

Encoder-Decoder: A model architecture (like T5) that processes input text into a representation (encoding) and then generates output text (decoding)

Decoder-only: A model architecture (like GPT or Mistral) that predicts the next token in a sequence, used for both understanding and generation