← Back to Paper List

A Normative Framework for Benchmarking Consumer Fairness in Large Language Model Recommender System

Yashar Deldjoo, Fatemeh Nazary
Polytechnic University of Bari
arXiv (2024)
Recommendation Benchmark P13N

📝 Paper Summary

LLM-based Recommender Systems (RecLLMs) Fairness in Recommender Systems
The paper introduces a normative framework for evaluating consumer fairness in RecLLMs by comparing recommendation benefits against clear reference points (neutral rankers or counterfactual scenarios) rather than just measuring raw output differences.
Core Problem
Existing fairness evaluations for RecLLMs often naively equate any difference in recommendations between groups with unfairness, failing to distinguish between valid personalization and harmful bias.
Why it matters:
  • LLMs inherit vast, unregulated biases from pre-training data, which can amplify stereotypes in recommender systems
  • Standard collaborative filtering fairness norms do not account for RecLLMs' ability to process natural language user profiles containing sensitive attributes
  • Previous frameworks like FaiRLLM flag all deviations as unfair, even when a sensitive ranker provides better utility (personalization) to a specific group
Concrete Example: If a RecLLM recommends 'Hey Young Girl' by Lloyd in a neutral setting but switches to Jamiroquai when gender is known, a naive system flags this as unfair. However, if the user actually prefers Jamiroquai, this is valid personalization, not bias. The proposed framework distinguishes these cases by checking if the change improves or harms utility.
Key Novelty
Normative Fairness Framework for RecLLMs
  • Defines fairness based on 'Benefit Deviation' (comparing utility against a reference) rather than just output disparity
  • Introduces three distinct fairness metrics: Neutral vs. Sensitive Deviation (impact of adding sensitive traits), Neutral vs. Counterfactual Deviation (impact of hypothetically swapping traits), and Intrinsic Fairness (alignment with target distributions)
  • Uses statistical significance testing (t-tests) to determine if observed benefit deviations across groups are random noise or systematic bias
Architecture
Architecture Figure Figure 1
An example of a RecLLM input prompt and the concept of Neutral vs. Sensitive rankers.
Evaluation Highlights
  • Experiments on MovieLens show fairness deviations in age-based recommendations, particularly when few-shot examples (ICL-2) are introduced
  • Statistical significance tests confirm that observed deviations in benefit between demographic groups are non-random
  • Demonstrates that adding context (few-shot) can amplify fairness issues compared to zero-shot scenarios
Breakthrough Assessment
7/10
Solid conceptual contribution that formalizes fairness auditing for RecLLMs, moving beyond naive disparity measures. It provides a principled way to distinguish personalization from bias, though the experimental validation is preliminary.
×