Exploring the Escalation of Source Bias in User, Data, and Recommender System Feedback Loop

📝 Paper Summary

Recommender Systems AI-Generated Content (AIGC) Bias Feedback Loops

Sequential recommender systems inherently rank LLM-generated text higher than human text, creating a feedback loop that progressively amplifies this bias and eventually degrades recommendation performance.

Core Problem

As AI-Generated Content (AIGC) floods the internet, recommender systems exhibit 'Source Bias,' preferentially ranking AIGC higher than human content. This bias creates a self-reinforcing feedback loop.

Why it matters:

Unfairness: Content creators may be forced to use LLMs to rewrite descriptions just to gain visibility, disadvantaging original human writing
Model Collapse: Training on excessive AIGC (which models prefer) eventually leads to a decline in recommendation accuracy and ecosystem diversity
Traffic Distribution: The bias causes unfair traffic allocation, pushing AIGC to the top regardless of actual user preference or utility

Concrete Example: A human seller writes a product description. An LLM rewrites it to be semantically identical but stylistically different. The recommender system ranks the LLM version higher solely due to its source style, pushing the human version down the list.

Key Novelty

Simulation of AIGC Escalation in Feedback Loops

Identifies 'Source Bias' where models favor LLM-generated text patterns (e.g., from Llama or Mistral) over human text
Simulates a three-phase evolution (HGC Dominate → Coexist → AIGC Dominate) to show how user interactions and retraining amplify this initial bias over time
Proposes a debiasing method using L1-loss optimization to align the embedding spaces of human and AI-generated content

Architecture

The Feedback Loop involving Users, Data, and the Recommender System

Evaluation Highlights

Demonstrates that popular sequential models (BERT4Rec, SASRec) rank AIGC copies higher than original human text across Amazon datasets (Health, Beauty, Sports)
Shows that ChatGPT-generated content induces less bias compared to other LLMs (Llama, Mistral), likely due to alignment training
Experiments reveal a decline in recommendation performance (NDCG/MAP) after AIGC dominates the feedback loop (20 iterations)

Breakthrough Assessment

7/10

Important identification of a subtle but systemic bias (Source Bias) in the LLM era. The simulation of feedback loops provides a necessary long-term view of AIGC's impact on RecSys, though the solution (L1 loss) is relatively standard.

⚙️ Technical Details

Problem Definition

Setting: Sequential Recommendation with mixed content sources (Human vs. AI)

Inputs: User interaction sequence S = {i_1, ..., i_t} consisting of mixed Human-Generated Content (HGC) and AIGC items

Outputs: Predicted next item i_{t+1} from a candidate set I

Pipeline Flow

Content Generation: Rewrite HGC items using LLMs (Llama, Mistral, etc.) to create AIGC copies
Recommender Training: Train sequential models (e.g., SASRec) on mixed history
User Simulation (Feedback Loop): Simulate user clicks on Top-K results using PBM
Model Update: Retrain model on new interaction data enriched with AIGC clicks

System Modules

Content Rewriter

Generate AIGC copies of product descriptions

Model or implementation: Various LLMs (Llama-2-7b-chat, Mistral-7B-Instruct-v0.2, Gemini-1.5-Pro, GPT-3.5-turbo)

Sequential Recommender

Predict next item based on history

Model or implementation: SASRec / BERT4Rec / GRU4Rec / LRURec

User Simulator

Simulate user clicks on the ranked list

Model or implementation: Position-Based Model (PBM)

Novel Architectural Elements

Integration of an AIGC-injection feedback loop simulation to measure long-term bias escalation in sequential recommenders

Modeling

Base Model: Sequential Recommenders: BERT4Rec, SASRec, GRU4Rec, LRURec

Training Method: Supervised learning on interaction sequences (Standard RecSys training)

Trainable Parameters: Recommendation model weights (PLM encoders are frozen)

Training Data:

Amazon Product Datasets (Health, Beauty, Sports)
Rewritten descriptions using LLMs
Items <20 words excluded
Users/Items <5 interactions excluded

Key Hyperparameters:

batch_size: 128
learning_rate: 1e-3
epochs: 5 (per loop iteration)
+ 2 more
item_embedding_dim: 768
max_sequence_length: 512 tokens (text), 10 items (history)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Static Debiasing: This paper addresses dynamic feedback loops where bias escalates over time, showing prior methods fail in long-term scenarios
vs. General Model Collapse: Specifically focuses on the mechanism of 'Source Bias' in ranking systems as the driver of collapse, rather than just data quality degradation

Limitations

Experiments rely on simulated user clicks (PBM) rather than live user traffic
Focus is restricted to text-based sequential recommendation; does not explore image or multimodal AIGC
Debiasing method details are briefly mentioned but full ablation is not provided in the text snippet

Reproducibility

Code: https://github.com/Yuqi-Zhou/Rec_SourceBias

Code and dataset publicly available at https://github.com/Yuqi-Zhou/Rec_SourceBias. Uses open-source LLMs (Llama, Mistral) and API-based ones (Gemini, ChatGPT).

📊 Experiments & Results

Evaluation Setup

Next-item prediction on Amazon datasets with simulated feedback loops

Benchmarks:

Amazon Beauty (Sequential Recommendation)
Amazon Health (Sequential Recommendation)
Amazon Sports (Sequential Recommendation)

Metrics:

NDCG@K (K=3, 5)
MAP@K (K=3, 5)
Relative Delta (Δ) (Preference metric)
Statistical methodology: Experiments run with 5 different seeds; results averaged.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Specific numeric results for performance deltas are not extractable from the provided text snippet, as the tables (Table 2, 3, 4) are referenced but their numeric content is not included in the source text. Qualitative findings are listed below.

Experiment Figures

The three phases of AIGC impact: HGC Dominate, HGC-AIGC Coexist, and AIGC Dominate

Human evaluation results comparing purchase inclination for HGC vs. AIGC

Main Takeaways

RecSys models exhibit 'Source Bias': They rank LLM-generated descriptions higher than semantically identical human descriptions across all tested domains (Health, Beauty, Sports).
Bias Escalation: In feedback loop simulations, as users interact with top-ranked AIGC, the model retrains on this data, progressively increasing the ranking of AIGC until it dominates the candidate list.
Performance Degradation: While AIGC initially boosts ranking (short-term), the dominance of AIGC in the training loop eventually leads to a decline in overall recommendation accuracy (long-term).
LLM Variance: Content generated by ChatGPT (gpt-3.5-turbo) triggers less source bias in recommenders compared to Llama-2 or Mistral, suggesting RLHF or alignment in ChatGPT may mimic human patterns better.

📚 Prerequisite Knowledge

Prerequisites

Sequential Recommender Systems (SASRec, BERT4Rec)
Feedback Loops in Machine Learning
Large Language Models (LLMs) for text generation

Key Terms

AIGC: AI-Generated Content—text produced by Large Language Models

HGC: Human-Generated Content—original text written by humans

Source Bias: The tendency of a recommender system to rank items higher solely because they are generated by an AI model

Feedback Loop: A cycle where user interactions with model recommendations generate new training data, which updates the model, reinforcing its biases

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items

PBM: Position-Based Model—a click simulation model where the probability of clicking depends on both relevance and rank position

Model Collapse: Degradation in model performance caused by training on synthetic data generated by previous versions of models