Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation

📝 Paper Summary

LLM for Recommendation (RecLLM) Fairness in LLMs

FaiRLLM is a benchmark that evaluates whether LLMs like ChatGPT treat users differently based on sensitive attributes by comparing recommendations given to sensitive versus neutral user profiles.

Core Problem

Large Language Models used for recommendation (RecLLM) may inherit social biases from pre-training data, leading to unfair treatment of users with certain sensitive attributes (e.g., race, gender).

Why it matters:

Vulnerable groups may receive systematically different or lower-quality recommendations if the model infers or is told their sensitive attributes
Traditional recommendation fairness metrics rely on prediction scores and fixed candidate sets, which are incompatible with the generative nature of LLMs
Users may choose to hide sensitive attributes for privacy, and systems should not penalize or favor them based on this disclosure or lack thereof

Concrete Example: When a user asks for Adele songs with the instruction 'I am a white fan of Adele', ChatGPT provides a standard list (e.g., 'Someone Like You'). When the instruction changes to 'I am an African American fan', the list shifts dramatically to different songs or genres, revealing an implicit bias in how the model perceives user preferences based on race.

Key Novelty

FaiRLLM Benchmark (Fairness of Recommendation via LLM)

Evaluates fairness by measuring the divergence in similarity between 'sensitive' recommendations (where a user attribute is explicit) and 'neutral' recommendations (where it is absent)
Introduces two new fairness metrics (SNSR and SNSV) that quantify how much the model's output varies across different groups compared to a neutral baseline
Constructs a dataset covering 8 sensitive attributes (e.g., race, religion, continent) across music and movie domains to probe generative recommenders systematically

Architecture

Conceptual workflow of the FaiRLLM evaluation benchmark

Evaluation Highlights

ChatGPT exhibits significant unfairness on the 'Race' attribute in movie recommendations, with a Sensitive-to-Neutral Similarity Variance (SNSV) of 0.0828 (PRAG*@20 metric)
Geography bias is prominent: 'Continent' and 'Country' attributes show high unfairness in music recommendations, with SNSV values of 0.0203 and 0.0141 respectively (Jaccard@20)
Unfairness persists across languages: Chinese prompts show similar patterns of disadvantage for 'African' and 'Asian' groups compared to 'American' groups in the continent attribute

Breakthrough Assessment

8/10

Pioneering work establishing the first benchmark for fairness in generative recommendation (RecLLM). While limited to ChatGPT and two domains, it defines the problem space and metrics for a critical emerging area.

⚙️ Technical Details

Problem Definition

Setting: Generative Top-K Recommendation given natural language user instructions

Inputs: User instruction prompt I containing explicit preferences (e.g., 'fan of Adele') and optionally a sensitive attribute value a (e.g., 'African American')

Outputs: Ordered list of K recommended items (titles)

Pipeline Flow

Generate Neutral Instruction (no sensitive attribute)
Generate Set of Sensitive Instructions (injecting specific attribute values)
Get LLM Recommendations for all instructions
Compute Similarity (Neutral vs. Sensitive lists)
Calculate Fairness Metrics (SNSR, SNSV) based on similarity divergence

System Modules

Instruction Generator

Constructs prompts using templates: 'I am a [sensitive_feature] fan of [name]. Please provide...'

Model or implementation: Template-based injection

LLM Recommender

Generates item lists based on prompts

Model or implementation: ChatGPT (OpenAI)

Fairness Evaluator

Computes similarity metrics between sensitive and neutral outputs to quantify bias

Model or implementation: N/A (Analytical script)

Novel Architectural Elements

Evaluation framework specifically designed for generative recommendation where the item set is open (not fixed candidates), using similarity-to-neutral as a proxy for fairness

Modeling

Base Model: ChatGPT (version used in May 2023 experiments)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Traditional Group Fairness: FaiRLLM uses 'similarity to neutral' instead of utility/accuracy metrics because ground truth for open-ended generation is hard to define and RecLLM has no fixed candidate set
vs. CrowS-Pairs [not cited in paper]: CrowS-Pairs evaluates bias in masked language models via likelihood of stereotypical sentences; FaiRLLM evaluates bias in generative recommendation via list similarity [not cited in paper]

Limitations

Evaluation is limited to user-side fairness (sensitive attributes of the user), ignoring item-side fairness
Relies on 'neutral' instruction as the ground truth reference, which itself might be biased
Experiments limited to ChatGPT; other LLMs (Llama, etc.) mentioned as future work but not evaluated here
Analysis is observational; does not propose mitigation strategies

Reproducibility

Code: https://github.com/jizhi-zhang/FaiRLLM

Available: Code and datasets (Music, Movie) are publicly available at https://github.com/jizhi-zhang/FaiRLLM. Missing: Exact version/checkpoint of ChatGPT used (beyond generic name), though hyperparameters are specified (temperature=0).

📊 Experiments & Results

Evaluation Setup

Top-K (K=20) Music and Movie recommendation

Benchmarks:

FaiRLLM - Music (Generative Music Recommendation) [New]
FaiRLLM - Movie (Generative Movie Recommendation) [New]

Metrics:

SNSR (Sensitive-to-Neutral Similarity Range)
SNSV (Sensitive-to-Neutral Similarity Variance)
Jaccard Similarity
SERP* Similarity
PRAG* Similarity
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Music recommendation fairness results showing ChatGPT's bias across sensitive attributes.
FaiRLLM - Music	SNSV (Jaccard@20)	0	0.0248	+0.0248
FaiRLLM - Music	SNSV (Jaccard@20)	0	0.0203	+0.0203
FaiRLLM - Music	SNSR (Jaccard@20)	0	0.0554	+0.0554
Movie recommendation fairness results showing significantly higher bias in Race and Country attributes compared to Music.
FaiRLLM - Movie	SNSV (PRAG*@20)	0	0.0828	+0.0828
FaiRLLM - Movie	SNSR (PRAG*@20)	0	0.2191	+0.2191
FaiRLLM - Movie	SNSV (Jaccard@20)	0	0.0619	+0.0619

Main Takeaways

ChatGPT displays clear unfairness across most sensitive attributes, with 'Race' and 'Geography' (Country/Continent) being the most severe sources of bias
Unfairness is robust to minor typos (e.g., 'Afrian' vs 'African') and persists across languages (Chinese vs English)
The degree of unfairness varies significantly by domain: Movie recommendations showed much higher racial bias (SNSV ~0.08) compared to Music (SNSV ~0.006)
Disadvantaged groups in the model (e.g., African, Black) align with real-world marginalized groups, suggesting the model reinforces existing social prejudices

📚 Prerequisite Knowledge

Prerequisites

Basic concepts of Fairness in Machine Learning (Individual vs. Group Fairness)
Generative Recommendation paradigms
Set similarity metrics (Jaccard)

Key Terms

RecLLM: Recommendation via Large Language Model—a paradigm where LLMs generate recommendations directly from user instructions

FaiRLLM: Fairness of Recommendation via LLM—the benchmark proposed in this paper

Sensitive Attribute: Personal user characteristics like race, gender, or religion that should not unjustly influence recommendation outcomes

SNSR: Sensitive-to-Neutral Similarity Range—a metric measuring the gap between the most advantaged and most disadvantaged groups' similarity to the neutral baseline

SNSV: Sensitive-to-Neutral Similarity Variance—a metric measuring the standard deviation of similarities to the neutral baseline across all sensitive groups

Jaccard similarity: A measure of set overlap: size of intersection divided by size of union

SERP*: A rank-sensitive similarity metric adapted from search result auditing; weights overlapping items by their rank

PRAG*: Pairwise Ranking Accuracy Gap—a similarity metric that considers relative pairwise ordering of items between two lists