← Back to Paper List

Exploring Safety-Utility Trade-Offs in Personalized Language Models

Anvesh Rao Vijjini, Somnath Basu Roy Chowdhury, Snigdha Chaturvedi
University of North Carolina at Chapel Hill
North American Chapter of the Association for Computational Linguistics (2024)
P13N Benchmark Reasoning

📝 Paper Summary

Bias and Fairness in LLMs Evaluation of Personalized LLMs
Personalizing LLMs by explicitly stating user attributes causes significant, uneven fluctuations in model utility and safety—defined as personalization bias—which often worsens after instruction tuning.
Core Problem
When LLMs are personalized to a user's specific demographic identity (e.g., 'I am a senior citizen'), they often exhibit erratic shifts in performance, compromising either safety or utility compared to neutral baselines.
Why it matters:
  • Existing bias research focuses on 'subject bias' (bias against a group) or 'persona bias' (bias when acting as a group), overlooking 'personalization bias' (bias when talking TO a group)
  • Users increasingly customize LLMs via system prompts; if models sandbag or refuse benign queries based on identity, fairness and usability are compromised
  • There is often an unacknowledged trade-off: increasing safety for certain identities (e.g., minors) might unintentionally degrade general reasoning utility
Concrete Example: When a user identifies as 'a senior citizen' in the system prompt, an LLM might respond to a math question with patronizing text like 'Let's see, my dear... My, my, that's a lot of roots' instead of a direct answer, or refuse to answer entirely, unlike its behavior with a neutral user.
Key Novelty
Quantifying Personalization Bias (PB) via Safety-Utility Trade-offs
  • Introduces 'Personalization Bias' as a distinct failure mode where model performance ($f(u)$) varies strictly based on the user's revealed identity ($u$)
  • Proposes a dual-axis evaluation framework measuring 'Utility' (reasoning/knowledge capabilities) versus 'Safety' (refusal of harmful prompts) to identify trade-offs
  • Defines a scalar PB score to quantify the variance of a model's performance across a set of demographic identities relative to the mean performance
Evaluation Highlights
  • Open-source LLMs exhibit PB scores (variance metric) ranging from 1.63 to 4.76, indicating significant instability across user identities
  • Instruction tuning exacerbates personalization bias: Llama-3.1 8B's utility PB score increases from 1.13 (pre-trained) to 1.25 (instruction-tuned)
  • Mistral 7B shows a sharp increase in personalization bias after instruction tuning, with the utility PB score rising from 1.54 to 2.21
Breakthrough Assessment
7/10
Important conceptual distinction (Personalization Bias vs. Persona Bias) and a rigorous evaluation framework. While it doesn't propose a new model, the analysis of training stages and trade-offs is valuable for the fairness community.
×