FairEval: Evaluating Fairness in LLM-Based Recommendations with Personality Awareness

📝 Paper Summary

LLM-based Recommender Systems (RecLLMs) Fairness and Bias Evaluation

FairEval is an evaluation framework that assesses fairness in LLM recommenders by integrating personality traits with demographic attributes and testing robustness against prompt variations.

Core Problem

LLM-based recommenders (RecLLMs) exhibit implicit biases based on user demographics and are highly sensitive to prompt phrasing, yet existing benchmarks overlook personality-driven unfairness and prompt robustness.

Why it matters:

RecLLMs are replacing traditional systems, making their unmeasured biases a critical societal risk for unequal opportunity and information access
Personality traits influence recommendations, meaning users may be unfairly stereotyped based on psychological profiles, not just protected attributes
Existing frameworks assume static prompts, failing to catch biases that appear only with specific phrasing, typos, or languages

Concrete Example: A neutral user asking for sci-fi movies gets 'Blade Runner 2049', but a user identifying as a 'Middle Eastern female professor' requesting the same gets 'Lawrence of Arabia', showing how the model overrides explicit preferences with cultural stereotypes.

Key Novelty

Personality-Aware Fairness Evaluation Framework

Integrates personality traits (e.g., introversion/extroversion) into fairness auditing alongside standard demographic attributes to detect psychological stereotyping
Evaluates robustness by perturbing prompts with typographical errors and multilingual translations (e.g., French) to measure fairness stability
Introduces PAFS (Personality-Aware Fairness Score) to quantify how consistently a model treats users across different personality profiles

Architecture

The FairEval framework pipeline for evaluating fairness in LLM-based recommender systems.

Evaluation Highlights

Discovered extreme fairness gaps in Gemini 1.5 Flash, with Sensitive-to-Neutral Similarity Range (SNSR) reaching 34.79% for religion-based music recommendations
ChatGPT 4o demonstrates superior personality consistency with PAFS@25 scores up to 0.9970, compared to lower stability in Gemini 1.5 Flash
Revealed significant robustness failures: Gemini's fairness scores drop below 0.60 under typographical noise, while ChatGPT maintains scores above 0.72

Breakthrough Assessment

7/10

Introduces a necessary dimension (personality) to RecLLM fairness and provides a rigorous stress-test methodology. While it doesn't propose a new model, the evaluation framework and metrics (PAFS) are valuable contributions.

⚙️ Technical Details

Problem Definition

Setting: Auditing black-box LLM recommenders by comparing outputs from neutral prompts vs. sensitive/personality-conditioned prompts

Inputs: Natural language prompts $p$ containing user requests, optionally augmented with sensitive attributes $a \in A$ (demographics) or personality traits

Outputs: Top-K ranked list of recommended items $R_p$

Pipeline Flow

Prompt Generation (Neutral, Demographic-Sensitive, Intersectional)
LLM Inference (Querying ChatGPT/Gemini)
Recommendation Parsing (Extracting lists)
Metric Calculation (Jaccard, SERP, PRAG, PAFS)

System Modules

Prompt Generator

Constructs $p_{neutral}$ and variations $p_{sensitive}$ injecting attributes like race, gender, occupation, and personality

Model or implementation: Template-based injection

Recommendation Engine

Generates item lists based on prompts

Model or implementation: ChatGPT 4o / Gemini 1.5 Flash

Fairness Evaluator

Computes similarity between neutral and sensitive outputs to quantify bias

Model or implementation: Statistical scripts

Novel Architectural Elements

Integration of personality traits into the sensitive attribute set for fairness auditing
Use of typographical and multilingual prompt perturbations as a fairness stress-test

Modeling

Base Model: ChatGPT 4o and Gemini 1.5 Flash

Compute: Not reported in the paper

Comparison to Prior Work

vs. FaiR-LLM: Adds personality-aware fairness (PAFS) and robustness testing against typos/languages
vs. CFairLLM: Evaluates intersectional attributes (e.g., female doctor) rather than single demographics
vs. FairPrompt-LLM [not cited in paper]: FairEval focuses on *evaluation* metrics (PAFS) rather than *mitigation* via prompt tuning

Limitations

Relies on API-based models (ChatGPT/Gemini) which change over time, affecting reproducibility
Analysis limited to Movie and Music domains; results may not transfer to high-stakes domains like news or finance
Evaluation uses similarity to a 'neutral' prompt as the gold standard, assuming the neutral prompt itself generates unbiased/ideal results
No open-source code or dataset provided

Reproducibility

No replication artifacts (code, prompts, or data) are provided in the paper. The methodology is described via templates, but the exact dataset of 1000 prompts is not linked.

📊 Experiments & Results

Evaluation Setup

Prompt-based auditing of black-box LLMs on Music (MTV artists) and Movie (IMDB directors) recommendation tasks

Benchmarks:

MTV Music Dataset (Artist Recommendation) [New]
IMDB Movie Dataset (Director/Movie Recommendation) [New]

Metrics:

Jaccard@25 (Set Overlap)
SERP*@25 (Rank-weighted exposure)
PRAG*@25 (Pairwise ranking alignment)
PAFS@25 (Personality-Aware Fairness Score)
SNSR (Range of disparity)
SNSV (Variance of disparity)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis of fairness disparities (SNSR) across sensitive attributes for ChatGPT 4o and Gemini 1.5 Flash. Higher SNSR indicates greater unfairness.
MTV Music Dataset	SNSR (Jaccard@25)	0.1900	0.3479	+0.1579
IMDB Movie Dataset	SNSR (PRAG*@25)	0.0261	0.1398	+0.1137
Evaluation of Personality-Aware Fairness (PAFS) stability. Higher PAFS indicates the model is more robust to personality variations.
MTV Music Dataset	PAFS@25 (Max)	0.9910	0.9970	+0.0060
IMDB Movie Dataset	PAFS@25 (Max)	0.9842	0.9940	+0.0098
Robustness under prompt perturbations (Typographical Errors).
Perturbed Prompts	PRAG*@25	0.5892	0.7214	+0.1322

Experiment Figures

Robustness analysis of ChatGPT 4o vs Gemini 1.5 Flash under Typographical Errors and French translation.

Qualitative example of 'Preference Dissimilarity' in movie recommendations.

Main Takeaways

Gemini 1.5 Flash exhibits extreme sensitivity to religion in music recommendations (SNSR > 34%), significantly higher than ChatGPT 4o.
ChatGPT 4o is generally more robust to both personality variations (higher PAFS) and prompt noise (typos/multilingual) than Gemini 1.5 Flash.
Intersectionality matters: Prompts combining demographics (e.g., 'Middle Eastern female professor') trigger distinct 'Preference Dissimilarity' where models substitute stereotypes for explicit genre preferences.
Fairness is domain-dependent: ChatGPT showed higher racial bias in movies but lower religious bias in music compared to Gemini.

📚 Prerequisite Knowledge

Prerequisites

Basics of Recommender Systems (collaborative filtering vs. content-based)
Understanding of LLM prompting strategies
Familiarity with set similarity metrics (Jaccard)

Key Terms

RecLLMs: Large Language Model-based Recommender Systems that generate recommendations directly from natural language prompts

PAFS: Personality-Aware Fairness Score—a metric measuring how stable recommendations remain when user personality traits are varied in the prompt

SNSR: Sensitive-to-Neutral Similarity Range—the difference between the maximum and minimum similarity scores across different sensitive groups (measures disparity)

SNSV: Sensitive-to-Neutral Similarity Variance—the statistical variance of similarity scores across groups (measures inconsistency)

Jaccard Similarity: A statistic used for comparing the similarity and diversity of sample sets (intersection over union)

SERP: Search Result Page metric—evaluates fairness by considering the rank/position of items in a list

PRAG: Pairwise Ranking Fairness metric—measures alignment of preference hierarchies between neutral and sensitive outputs

Preference Dissimilarity: When the model recommends items that align with a stereotype rather than the user's explicit genre/content preference