Item-side Fairness of Large Language Model-based Recommendation System

📝 Paper Summary

LLM-based Recommendation Systems (LRS) Trustworthy Recommendation

The paper reveals that LLM-based recommendation systems exhibit severe item-side unfairness due to popularity and semantic biases, and proposes the IFairLRS framework to mitigate this via training reweighting and inference reranking.

Core Problem

LLM-based Recommendation Systems (LRS) suffer from significant item-side unfairness, over-recommending popular items and specific genres due to biases in interaction history and the LLM's pre-trained semantic priors.

Why it matters:

Fair exposure is critical for the economic rights of item producers (e.g., job candidates, micro-businesses) and the visibility of content related to vulnerable populations
Existing fairness methods for conventional discriminative models do not directly apply to generative LRS, which rely on instruction tuning and text generation
Prior work like LLMRank only qualitatively observed popularity bias; a comprehensive quantitative investigation and mitigation framework for LRS was lacking

Concrete Example: In a movie recommendation scenario, an LRS like BIGRec might recommend 'The Mighty Ducks' (a popular comedy) even if the specific genre 'Comedy' was removed from the fine-tuning data, demonstrating that the model relies on unfair semantic priors from pre-training rather than just user history.

Key Novelty

IFairLRS Framework (In-learning Reweighting + Post-learning Reranking)

Conducts the first comprehensive quantitative audit of item-side fairness in LRS, distinguishing between biases arising from historical interactions (popularity) and biases from LLM semantic priors (genres)
Proposes a two-stage mitigation framework: 'In-learning' reweights training samples to balance target item distribution, and 'Post-learning' reranks outputs to punish unfairness [Implementation details not in provided text]

Evaluation Highlights

Comparison with SASRec (Self-Attentive Sequential Recommendation) reveals LRS is significantly more influenced by popularity bias, consistently recommending more popular items
Probing experiments show LRS recommends item genres never seen during fine-tuning, proving that unfairness stems partly from pre-trained semantic knowledge, not just interaction data
Analysis of 'grounding' (mapping generated text to items) shows it mitigates some inherent unfairness but transfers bias from low-popularity to high-popularity groups

Breakthrough Assessment

7/10

Significant for identifying that LRS fairness issues stem from pre-training priors, not just data imbalance. The proposed solution framework is standard (reweighting/reranking) but applied to a novel domain (LRS).

⚙️ Technical Details

Problem Definition

Setting: Sequential Recommendation with Item-side Fairness constraints

Inputs: User historical interaction sequence (represented as natural language text)

Outputs: Ranked list of Top-K recommended items

Pipeline Flow

Prompt Construction (History to Text)
LLM Generation (Text-to-Text)
Grounding (Text-to-Item)

System Modules

Prompt Construction

Convert user interaction history into a natural language instruction prompt

Model or implementation: Deterministic formatting

LLM Generation

Generate the textual description (e.g., title) of the next item of interest

Model or implementation: LLaMA (Instruction-tuned)

Grounding

Map the generated description to valid items in the candidate set

Model or implementation: L2 Embedding Distance

Modeling

Base Model: LLaMA

Training Method: Instruction Tuning (SFT)

Training Data:

MovieLens1M and Steam datasets
Split by timestamp (8:1:1 for train/val/test)
Steam dataset filtered: genres with <10k interactions removed; max 10 interactions per user retained
Sampled 65,536 instances for training

Key Hyperparameters:

training_samples: 65536

Compute: Not reported in the paper

Comparison to Prior Work

vs. SASRec: LRS (BIGRec) relies on generative semantics rather than discriminative IDs, leading to higher popularity bias
vs. BIGRec: The paper proposes IFairLRS (framework details not fully in text) to correct the unfairness observed in BIGRec
vs. LLMRank: LLMRank only qualitatively notes popularity bias; this paper quantifies it and distinguishes semantic vs. interaction bias

Limitations

Grounding process introduces its own biases, transferring unfairness from low to high popularity groups
Beam search during inference significantly amplifies inequity compared to greedy decoding
LRS retains fairness issues from pre-training even when fine-tuning data is balanced or filtered

Reproducibility

Code: https://github.com/JiangM-C/IFairLRS.git

Code is publicly available on GitHub. Datasets (MovieLens, Steam) are public. Training data sampling method (65,536 instances) is specified. Specific SFT hyperparameters (learning rate, batch size) are not in the provided text.

📊 Experiments & Results

Evaluation Setup

Sequential recommendation on MovieLens1M and Steam datasets

Benchmarks:

MovieLens1M (Movie Recommendation)
Steam (Game Recommendation)

Metrics:

GP (Group Proportion)
GU (Group Unfairness)
MGU (Mean Group Unfairness)
DGU (Disparity Group Unfairness)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Comparison of item popularity distribution in recommendations vs. history for MovieLens1M and Steam

Impact of removing specific genres from training data on recommendation probability

Main Takeaways

LRS (BIGRec) is significantly more unfair than traditional models (SASRec) regarding popularity, consistently over-recommending popular items.
LRS exhibits semantic bias: it recommends items from genres (e.g., Comedy) even if those genres were removed from the fine-tuning data, indicating reliance on pre-trained knowledge.
The 'Grounding' phase (mapping text to items) helps mitigate some unfairness for unpopular items but can inadvertently boost high-popularity groups.
Increasing K (in Top-K) alleviates popularity unfairness in LRS as the grounding retrieves a wider range of items.

📚 Prerequisite Knowledge

Prerequisites

Basics of Sequential Recommendation
Generative Recommendation (LLM-based)
Fairness metrics (Group Fairness)

Key Terms

LRS: Large Language Model-based Recommendation System—systems using LLMs to generate recommendations from natural language descriptions of user history

Item-side Fairness: Ensuring different item groups (e.g., unpopular items, specific genres) receive fair exposure opportunities relative to their presence in user history

Grounding: The process of mapping the text generated by an LLM (e.g., a movie title) to a specific item ID in the database, often using embedding distance

SFT: Supervised Fine-Tuning—training the LLM on specific recommendation tasks using labeled interaction data

SASRec: Self-Attentive Sequential Recommendation—a traditional (non-LLM) deep learning baseline for sequential recommendation

BIGRec: A representative LRS that fine-tunes LLaMA on interaction data and uses embedding-based grounding for retrieval