← Back to Paper List

Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation

Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, Xiangnan He
University of Science and Technology of China, National University of Singapore
arXiv (2023)
Recommendation Benchmark

📝 Paper Summary

LLM for Recommendation (RecLLM) Fairness in LLMs
FaiRLLM is a benchmark that evaluates whether LLMs like ChatGPT treat users differently based on sensitive attributes by comparing recommendations given to sensitive versus neutral user profiles.
Core Problem
Large Language Models used for recommendation (RecLLM) may inherit social biases from pre-training data, leading to unfair treatment of users with certain sensitive attributes (e.g., race, gender).
Why it matters:
  • Vulnerable groups may receive systematically different or lower-quality recommendations if the model infers or is told their sensitive attributes
  • Traditional recommendation fairness metrics rely on prediction scores and fixed candidate sets, which are incompatible with the generative nature of LLMs
  • Users may choose to hide sensitive attributes for privacy, and systems should not penalize or favor them based on this disclosure or lack thereof
Concrete Example: When a user asks for Adele songs with the instruction 'I am a white fan of Adele', ChatGPT provides a standard list (e.g., 'Someone Like You'). When the instruction changes to 'I am an African American fan', the list shifts dramatically to different songs or genres, revealing an implicit bias in how the model perceives user preferences based on race.
Key Novelty
FaiRLLM Benchmark (Fairness of Recommendation via LLM)
  • Evaluates fairness by measuring the divergence in similarity between 'sensitive' recommendations (where a user attribute is explicit) and 'neutral' recommendations (where it is absent)
  • Introduces two new fairness metrics (SNSR and SNSV) that quantify how much the model's output varies across different groups compared to a neutral baseline
  • Constructs a dataset covering 8 sensitive attributes (e.g., race, religion, continent) across music and movie domains to probe generative recommenders systematically
Architecture
Architecture Figure Figure 1
Conceptual workflow of the FaiRLLM evaluation benchmark
Evaluation Highlights
  • ChatGPT exhibits significant unfairness on the 'Race' attribute in movie recommendations, with a Sensitive-to-Neutral Similarity Variance (SNSV) of 0.0828 (PRAG*@20 metric)
  • Geography bias is prominent: 'Continent' and 'Country' attributes show high unfairness in music recommendations, with SNSV values of 0.0203 and 0.0141 respectively (Jaccard@20)
  • Unfairness persists across languages: Chinese prompts show similar patterns of disadvantage for 'African' and 'Asian' groups compared to 'American' groups in the continent attribute
Breakthrough Assessment
8/10
Pioneering work establishing the first benchmark for fairness in generative recommendation (RecLLM). While limited to ChatGPT and two domains, it defines the problem space and metrics for a critical emerging area.
×