Echoes in the Loop: Diagnosing Risks in LLM-Powered Recommender Systems under Feedback Loops

📝 Paper Summary

LLM-Powered Recommender Systems (LLM4RS) AI Safety and Risk Assessment Feedback Loop Dynamics

EchoTrace is a diagnostic framework that reveals how LLM-induced risks like hallucination and bias accumulate and amplify through feedback loops in recommender systems, causing long-term ecosystem polarization.

Core Problem

Large Language Models (LLMs) embedded in recommender systems introduce hallucinations and biases that are not just static errors but propagate and intensify when model outputs are fed back as training data.

Why it matters:

LLMs may overemphasize popularity signals or fabricate user attributes (e.g., non-existent occupations), grounding recommendations in false premises
Feedback loops in recommender systems naturally reinforce existing patterns, but LLM-specific artifacts (hallucinations) create qualitatively different risks like artificial polarization
Current evaluations focus on short-term accuracy, ignoring how LLM-generated errors accumulate over repeated recommendation cycles to distort the entire content ecosystem

Concrete Example: An LLM-based profile generator infers a user is a 'film critic' (a hallucinated attribute not in the data) because they watched many movies. This fabrication is fed back into the system, causing the recommender to suggest items for critics, reinforcing the false attribute and diverging from the user's actual preferences.

Key Novelty

EchoTrace: Role-Aware, Phase-Wise Risk Diagnosis

Decomposes risk analysis into three phases: Content Generation (inspecting intermediate LLM outputs), Recommendation (inspecting ranked lists), and Feedback Loop (inspecting long-term evolution)
Simulates a closed-loop environment where LLM-influenced recommendations are strictly re-injected as training data to measure how risks accumulate over time (e.g., representation drift)

Architecture

The EchoTrace diagnostic framework pipeline, illustrating the three distinct phases of risk analysis in a feedback loop.

Evaluation Highlights

Hallucination (FEF) rates in generated user profiles reach 93.16% for 'Occupation' on MovieLens-1M, creating widespread spurious user signals
Feedback loops increase ecosystem polarization: the distance between user embedding clusters grows from 3.73 to 9.29 over 5 periods in A-LLMRec
Gender bias amplifies over time: the share of the dominant group (Male) in MovieLens-1M increases from 85.90% to 86.80% due to feedback dynamics

Breakthrough Assessment

8/10

Provides a crucial methodological shift from single-step accuracy to long-term ecosystem health in LLM-based recommendation. The finding that LLMs induce structural polarization distinct from traditional algorithms is significant.

⚙️ Technical Details

Problem Definition

Setting: Controlled feedback-loop simulation where a recommender system is repeatedly retrained on its own outputs mixed with user interaction history

Inputs: User-Item interaction log D, temporal cutoff t, number of feedback periods N

Outputs: Longitudinal measurements of Bias, Hallucination (FEF/LC), and Polarization (embedding distance)

Pipeline Flow

Initial Split (partition data into D_train and D_gt)
Cycle Start: Content Generation Phase (LLM creates profiles/signals)
Recommendation Phase (System ranks items for active users)
Feedback Loop Phase (Top-K items injected into D_train)
Retraining (System updates on D_train + Injected items) -> Repeat

System Modules

Content Generator

Generate intermediate signals (profiles, augmented interactions) from raw history

Model or implementation: Depends on baseline (e.g., gpt-4o-2024-08-06)

Ranking Engine

Produce final Top-K items for users

Model or implementation: Baseline RS (e.g., LightGCN) or LLM (e.g., facebook/opt-6.7b)

Feedback Simulator

Inject recommendations back into training data to simulate user consumption

Model or implementation: Deterministic Injection Rule

Novel Architectural Elements

Three-phase diagnostic pipeline specifically designed to trace risk propagation from content generation to ecosystem-level polarization
Controlled feedback injection mechanism that isolates LLM-induced drift from natural user behavior evolution

Modeling

Base Model: Varies by baseline role: gpt-4o-2024-08-06 (Augmenter/Representer), facebook/opt-6.7b (Recommender)

Training Method: Evaluation Framework (Paper applies framework to existing models)

Training Data:

MovieLens-1M (split at 80% temporal)
Amazon-Books (split at 50% temporal)

Key Hyperparameters:

feedback_periods_N: 5
top_k: Dynamic (matches user's actual activity volume in ground truth)
temporal_split_t: 0.8 (ML-1M) / 0.5 (A-Books)

Compute: Single NVIDIA RTX 6000 Ada Generation GPU

Comparison to Prior Work

vs. LightGCN: LightGCN shows minimal polarization over time; LLM-powered methods show significant embedding divergence
vs. Traditional Feedback Loop studies: Prior work focuses on popularity bias in CF; this work isolates LLM-specific hallucinations (e.g. fake attributes) as a driver of ecosystem drift
vs. GenRec [not cited in paper]: Generative RS typically focus on accuracy; EchoTrace focuses on the longitudinal safety risks of the generative process itself

Limitations

Relies on simulated feedback (assuming users consume recommendations) rather than live user experiments
High cost of API-based LLMs (GPT-4o) limits the scale of feedback loop iterations (N=5)
Focuses on text-based LLM roles; does not cover multi-modal LLMs in RS

Reproducibility

Code: https://github.com/DongUk-Park/EchoTrace

Publicly available code (GitHub). Uses closed-source LLM (GPT-4o) for some baselines, incurring API costs ($15.6-$122.4 per period). Open-source baselines use OPT-6.7b.

📊 Experiments & Results

Evaluation Setup

Longitudinal simulation of recommendation cycles over 5 distinct time periods

Benchmarks:

MovieLens-1M (ML-1M) (Movie Recommendation)
Amazon-Books (A-Books) (Book Recommendation)

Metrics:

FEF Rate (Factual Errors and Fabrications)
LC Rate (Logical Contradictions)
Popularity Gap (Difference between rec popularity and ground truth)
Cluster Distance (Euclidean distance between embedding centroids)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Phase 1 results reveal high rates of hallucination in LLM-generated profiles, creating a distorted foundation for downstream learning.
ML-1M	FEF Rate (Occupation)	0	93.16	+93.16
ML-1M	FEF Rate (Age)	0	73.68	+73.68
Phase 2 results show how decision-making roles introduce invalid items and instability.
A-Books	FEF Rate (Items)	0	7.40	+7.40
ML-1M	FEF Rate (Items)	0	4.07	+4.07
Phase 3 results demonstrate the long-term accumulation of risks, specifically ecosystem polarization and bias amplification.
ML-1M	Cluster Distance (Users)	3.73	9.29	+5.56
ML-1M	Cluster Distance (Items)	1.09	2.09	+1.00
ML-1M	Gender Bias (Male Share)	85.90	86.80	+0.90

Experiment Figures

T-SNE visualization of user and item embeddings at the start (Period 1) and end (Period 5) of the feedback loop for A-LLMRec vs. LightGCN.

Main Takeaways

LLMs frequently hallucinate user attributes (e.g., assuming high-activity users are 'critics'), creating artificial signals that the recommender system learns to reinforce.
While traditional RS (LightGCN) maintains stable embedding structures, LLM-powered RS causes 'polarization,' where user/item representations drift into separated clusters over time.
Open-ended LLM recommenders (LLM-as-Recommender) reduce popularity bias superficially but do so by injecting valid hallucinations (invalid items), trading reliability for diversity.
Risks are not static; minor biases in the initial LLM generation phase accumulate into significant structural shifts in the ecosystem after multiple feedback loops.

📚 Prerequisite Knowledge

Prerequisites

Recommender Systems (Collaborative Filtering)
Feedback Loops in Machine Learning
LLM Hallucination types (Factuality vs. Consistency)

Key Terms

LLM4RS: LLM-Powered Recommender Systems—systems integrating LLMs for data augmentation, profiling, or direct recommendation

FEF: Factual Errors and Fabrications—a metric measuring the proportion of generated attributes or items that do not exist in the ground truth or system candidate set

LC: Logical Contradictions—a metric measuring inconsistency where repeated executions under identical inputs yield different outputs

LLMGC: LLM-Generated Content—intermediate signals like user profiles, item descriptions, or synthetic interactions created by the LLM

LLM-as-Augmenter: Using an LLM to generate synthetic interaction data to enrich sparse training sets

LLM-as-Representer: Using an LLM to summarize interaction history into user/item profiles or embeddings

LLM-as-Recommender: Using an LLM to directly generate a ranked list of items for a user

Popularity Gap: The difference in average item popularity between the recommendation list and the user's actual consumption history

Polarization: The phenomenon where user or item representations in the embedding space drift apart into distinct, separated clusters over time