Personalization Bias: Bias exhibited when an LLM's performance (safety or utility) fluctuates based on the explicit identity of the user it is interacting with
Persona Bias: Bias exhibited when an LLM is asked to adopt a specific persona (e.g., 'Talk like a Muslim')
Subject Bias: Bias exhibited when an LLM generates content about a specific demographic group
PB Score: A metric quantifying the variance in model performance across different user identities; lower is better
System Prompt: A high-level instruction given to the LLM to define its behavior or context (e.g., 'You are a helpful assistant talking to a [identity]')
Utility: The model's ability to perform reasoning and knowledge tasks correctly (measured via MMLU, GSM8K, MBPP)
Safety: The model's ability to refuse harmful instructions or provide benign responses to unsafe prompts (measured via Do-Not-Answer, StrongReject)
Identity Leakage: When an LLM mistakenly adopts the user's identity as its own persona (e.g., responding 'As a disabled person, I...' when the user is the one who is disabled)
Instruction Tuning: A training phase where the model is fine-tuned on dataset of (instruction, output) pairs to follow commands
Preference Tuning: A training phase (like RLHF or DPO) where the model is aligned with human preferences, often to improve safety