Evaluation Setup
Prompt-based auditing of black-box LLMs on Music (MTV artists) and Movie (IMDB directors) recommendation tasks
Benchmarks:
- MTV Music Dataset (Artist Recommendation) [New]
- IMDB Movie Dataset (Director/Movie Recommendation) [New]
Metrics:
- Jaccard@25 (Set Overlap)
- SERP*@25 (Rank-weighted exposure)
- PRAG*@25 (Pairwise ranking alignment)
- PAFS@25 (Personality-Aware Fairness Score)
- SNSR (Range of disparity)
- SNSV (Variance of disparity)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparative analysis of fairness disparities (SNSR) across sensitive attributes for ChatGPT 4o and Gemini 1.5 Flash. Higher SNSR indicates greater unfairness. |
| MTV Music Dataset |
SNSR (Jaccard@25) |
0.1900 |
0.3479 |
+0.1579
|
| IMDB Movie Dataset |
SNSR (PRAG*@25) |
0.0261 |
0.1398 |
+0.1137
|
| Evaluation of Personality-Aware Fairness (PAFS) stability. Higher PAFS indicates the model is more robust to personality variations. |
| MTV Music Dataset |
PAFS@25 (Max) |
0.9910 |
0.9970 |
+0.0060
|
| IMDB Movie Dataset |
PAFS@25 (Max) |
0.9842 |
0.9940 |
+0.0098
|
| Robustness under prompt perturbations (Typographical Errors). |
| Perturbed Prompts |
PRAG*@25 |
0.5892 |
0.7214 |
+0.1322
|
Main Takeaways
- Gemini 1.5 Flash exhibits extreme sensitivity to religion in music recommendations (SNSR > 34%), significantly higher than ChatGPT 4o.
- ChatGPT 4o is generally more robust to both personality variations (higher PAFS) and prompt noise (typos/multilingual) than Gemini 1.5 Flash.
- Intersectionality matters: Prompts combining demographics (e.g., 'Middle Eastern female professor') trigger distinct 'Preference Dissimilarity' where models substitute stereotypes for explicit genre preferences.
- Fairness is domain-dependent: ChatGPT showed higher racial bias in movies but lower religious bias in music compared to Gemini.