Exploring Safety-Utility Trade-Offs in Personalized Language Models

📝 Paper Summary

Bias and Fairness in LLMs Evaluation of Personalized LLMs

Personalizing LLMs by explicitly stating user attributes causes significant, uneven fluctuations in model utility and safety—defined as personalization bias—which often worsens after instruction tuning.

Core Problem

When LLMs are personalized to a user's specific demographic identity (e.g., 'I am a senior citizen'), they often exhibit erratic shifts in performance, compromising either safety or utility compared to neutral baselines.

Why it matters:

Existing bias research focuses on 'subject bias' (bias against a group) or 'persona bias' (bias when acting as a group), overlooking 'personalization bias' (bias when talking TO a group)
Users increasingly customize LLMs via system prompts; if models sandbag or refuse benign queries based on identity, fairness and usability are compromised
There is often an unacknowledged trade-off: increasing safety for certain identities (e.g., minors) might unintentionally degrade general reasoning utility

Concrete Example: When a user identifies as 'a senior citizen' in the system prompt, an LLM might respond to a math question with patronizing text like 'Let's see, my dear... My, my, that's a lot of roots' instead of a direct answer, or refuse to answer entirely, unlike its behavior with a neutral user.

Key Novelty

Quantifying Personalization Bias (PB) via Safety-Utility Trade-offs

Introduces 'Personalization Bias' as a distinct failure mode where model performance ($f(u)$) varies strictly based on the user's revealed identity ($u$)
Proposes a dual-axis evaluation framework measuring 'Utility' (reasoning/knowledge capabilities) versus 'Safety' (refusal of harmful prompts) to identify trade-offs
Defines a scalar PB score to quantify the variance of a model's performance across a set of demographic identities relative to the mean performance

Evaluation Highlights

Open-source LLMs exhibit PB scores (variance metric) ranging from 1.63 to 4.76, indicating significant instability across user identities
Instruction tuning exacerbates personalization bias: Llama-3.1 8B's utility PB score increases from 1.13 (pre-trained) to 1.25 (instruction-tuned)
Mistral 7B shows a sharp increase in personalization bias after instruction tuning, with the utility PB score rising from 1.54 to 2.21

Breakthrough Assessment

7/10

Important conceptual distinction (Personalization Bias vs. Persona Bias) and a rigorous evaluation framework. While it doesn't propose a new model, the analysis of training stages and trade-offs is valuable for the fairness community.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of personalized LLM performance variability across a set of user identities $\mathcal{U}$

Inputs: A user identity $u \in \mathcal{U}$ embedded in a system prompt, and a task query $q$

Outputs: The model's response $r$, evaluated for utility (correctness) or safety (benign refusal)

Pipeline Flow

Input Construction (Embed Identity into System Prompt)
Model Inference (Generate Response)
Evaluation (Measure Safety & Utility)

System Modules

Input Construction

Inject user identity into the system prompt using a template that minimizes leakage

Model or implementation: Prompt Template [P6] (Selected via ablation)

Model Inference

Generate response based on personalized context

Model or implementation: Various (Llama-3.1, GPT-4o, etc.)

Evaluator

Assess response for Utility (Accuracy) or Safety (Refusal)

Model or implementation: Deterministic metrics (Accuracy) or Classifier-based safety checks

Novel Architectural Elements

Prompt selection framework designed specifically to maximize 'Identity Imprinting' (model knows who user is) while minimizing 'Identity Leakage' (model doesn't pretend to be the user)

Modeling

Base Model: Evaluated on: Llama-2 (13B/70B), Llama-3.1 (8B/70B), Mistral-7B, Mixtral 8x7B, GPT-3.5 (gpt-3.5-turbo-0125), GPT-4o (gpt-4o-2024-05-13)

Training Method: Analysis of existing checkpoints (Pretrained, Instruct, Preference-tuned)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Gupta et al. (2023): Examines bias when the *user* has the identity, not the model. Measures trade-off between Utility and Safety, not just one.
vs. Li et al. (2024b) [concurrent]: Li et al. focus on refusal rates and political bias; this work quantifies bias across broad utility tasks (Math, Code) and Safety simultaneously.
vs. He et al. (2024) [contemporary]: He et al. propose decoding strategies to mitigate; this paper focuses on quantification and the impact of training stages.

Limitations

Relies on self-reported safety/refusals which can be brittle
Focuses on explicit identity in system prompts, not implicit personalization from interaction history
Analysis of training stages is limited to available checkpoints, not controlled retraining from scratch
Safety datasets (DNA, StrongReject) may have overlap with model training data (contamination)

Reproducibility

Code: https://github.com/brcsomnath/personalization-bias

Code and data (prompts) available at https://github.com/brcsomnath/personalization-bias. The paper lists all 31 user identities and the specific prompt templates used. Experiments rely on public checkpoints (HuggingFace) and proprietary APIs (OpenAI).

📊 Experiments & Results

Evaluation Setup

Zero-shot prompting with explicit user identity in system prompt.

Benchmarks:

MMLU (General Knowledge & Reasoning)
GSM8K (Grade-school Math)
MBPP (Python Programming)
Do-Not-Answer (DNA) (Safety/Refusal)
StrongReject (Safety/Refusal)

Metrics:

Accuracy (Utility)
Safety Score (Refusal Rate)
PB Score (Personalization Bias - Variance from mean)
Statistical methodology: Average performance across 3 runs reported for open-source models.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of how Personalization Bias (PB) scores change across training stages (Pre-training to Instruction Tuning).
MMLU	PB Score (Utility)	1.13	1.25	+0.12
MMLU	PB Score (Utility)	1.54	2.21	+0.67
Safety-Utility Trade-off observations showing that user identity impacts model behavior.
Aggregate (Open Source LLMs)	PB Score Range	0	1.63	+1.63

Experiment Figures

Scatter plots of Safety (y-axis) vs. Utility (x-axis) for various LLMs across 31 user identities.

Bar charts showing Utility PB Scores across training stages (Pre-train, Instruct, Preference).

Main Takeaways

Providing any user identity generally increases safety compared to 'no identity', but creates significant variance in utility (Personalization Bias).
Instruction tuning appears to be a major contributor to personalization bias, often increasing the PB score compared to pre-trained models.
Certain identities trigger consistent behavior across models: 'minor' identity improves safety, while 'non-binary' often reduces safety scores.
Intersectional identities (e.g., 'Jewish African lesbian') show safety scores roughly averaging the components, but can yield distinct trade-offs.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM prompting (System vs. User prompts)
Familiarity with standard LLM benchmarks (MMLU, GSM8K)
Basic concepts of AI Safety (refusals, jailbreaks)

Key Terms

Personalization Bias: Bias exhibited when an LLM's performance (safety or utility) fluctuates based on the explicit identity of the user it is interacting with

Persona Bias: Bias exhibited when an LLM is asked to adopt a specific persona (e.g., 'Talk like a Muslim')

Subject Bias: Bias exhibited when an LLM generates content about a specific demographic group

PB Score: A metric quantifying the variance in model performance across different user identities; lower is better

System Prompt: A high-level instruction given to the LLM to define its behavior or context (e.g., 'You are a helpful assistant talking to a [identity]')

Utility: The model's ability to perform reasoning and knowledge tasks correctly (measured via MMLU, GSM8K, MBPP)

Safety: The model's ability to refuse harmful instructions or provide benign responses to unsafe prompts (measured via Do-Not-Answer, StrongReject)

Identity Leakage: When an LLM mistakenly adopts the user's identity as its own persona (e.g., responding 'As a disabled person, I...' when the user is the one who is disabled)

Instruction Tuning: A training phase where the model is fine-tuned on dataset of (instruction, output) pairs to follow commands

Preference Tuning: A training phase (like RLHF or DPO) where the model is aligned with human preferences, often to improve safety