Chat-REC: Towards Interactive and Explainable LLMs-Augmented Recommender System

📝 Paper Summary

LLM-augmented Recommender Systems Conversational Recommendation

Chat-REC augments traditional recommender systems by converting user profiles and history into prompts for Large Language Models (LLMs), enabling interactive dialogue, cross-domain transfer, and cold-start handling via in-context learning.

Core Problem

Traditional recommender systems suffer from poor interactivity and explainability, struggle with cold-start scenarios for new items, and have difficulty transferring preferences across domains.

Why it matters:

Current systems lack natural feedback mechanisms, making it hard for users to refine requests or understand why an item was suggested.
Static candidate generation often fails to capture dynamic user intent or leverage broader world knowledge about new or cross-domain items.
Manual information searching is infeasible in big data eras, but automated systems often feel like black boxes.

Concrete Example: A user asks for action movies. A traditional system just lists titles. Chat-REC provides a list, but if the user then asks 'Why Fargo?', it explains based on user history. If the user asks for non-movie recommendations (e.g., books) based on those movie preferences, traditional systems fail, but Chat-REC suggests books or games.

Key Novelty

In-Context Learning for Candidate Refinement

Instead of training the LLM, the system converts user history and profiles into text prompts.
A traditional recommender generates a candidate set, which the LLM then re-ranks, filters, or explains based on the prompt context.
The LLM acts as an interactive interface, allowing multi-turn refinement and cross-domain reasoning without parameter updates.

Architecture

Overview of Chat-Rec framework linking user queries to a recommender system via an LLM interface.

Evaluation Highlights

Chat-Rec (text-davinci-003) achieves 0.3802 NDCG on MovieLens 100K top-5 recommendation, outperforming LightGCN by +11.01%.
In zero-shot rating prediction, Chat-Rec (text-davinci-003) reaches an RMSE of 0.785, improving over Item-KNN (0.933) by ~15.8%.
Ablation shows that removing the traditional recommender's top-1 item from the prompt background drops NDCG performance by ~19%, proving the value of injecting recommender priors.

Breakthrough Assessment

7/10

Offers a practical, training-free paradigm for combining classic recommenders with LLMs. While methodologically simple (prompt engineering), the empirical gains and multi-scenario flexibility (cold start, cross-domain) are significant.

⚙️ Technical Details

Problem Definition

Setting: Conversational recommendation where the system must generate items, explanations, or responses based on dialogue history and user profiles.

Inputs: User-item history interactions, User profile, User query Q_i, Dialogue history H_<i

Outputs: Response R (which may include a list of recommended items or natural language explanation)

Pipeline Flow

User Interface (receives query)
Prompt Constructor (aggregates history/profile)
Traditional Recommender (generates candidate set)
LLM Interface (Ranking/Explanation)

System Modules

Prompt Constructor

Synthesize user profile, interaction history, and current query into a natural language prompt

Model or implementation: Rule-based template

Traditional Recommender

Retrieve a coarse candidate set of items to narrow the search space for the LLM

Model or implementation: Any standard recommender (e.g., LightFM, Matrix Factorization)

LLM Inference

Reason over the candidate set to re-rank items, select top-k, or generate explanations

Model or implementation: GPT-3.5 series (text-davinci-003, gpt-3.5-turbo)

Novel Architectural Elements

Two-step filtering where a traditional recommender acts as a 'candidate generator' and the LLM acts as a 'reasoning ranker' via prompt injection
Injection of external knowledge embeddings for new items to handle cold-start within the prompt context

Modeling

Base Model: text-davinci-003, text-davinci-002, gpt-3.5-turbo

📊 Experiments & Results

Evaluation Setup

Top-k recommendation and Rating Prediction on MovieLens 100K

Benchmarks:

MovieLens 100K (Movie Recommendation / Rating Prediction)

Metrics:

Precision
Recall
NDCG
RMSE
MAE
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Top-5 Recommendation Results: Chat-Rec variants generally outperform baselines in Precision and NDCG, with text-davinci-003 performing best.
MovieLens 100K (Top-5)	Precision	0.3030	0.3240	+0.0210
MovieLens 100K (Top-5)	NDCG	0.3425	0.3802	+0.0377
MovieLens 100K (Top-5)	Recall	0.1455	0.1404	-0.0051
Rating Prediction Results: Chat-Rec significantly outperforms baselines in predicting explicit ratings.
MovieLens 100K (Rating Prediction)	RMSE	0.933	0.785	-0.148
MovieLens 100K (Rating Prediction)	MAE	0.734	0.593	-0.141
Ablation on Prompt Construction: Removing the recommender system's top-1 item from the prompt background severely hurts performance.
MovieLens 100K (Top-5)	NDCG	0.3802	0.3055	-0.0747

Main Takeaways

In-Context Learning is effective for recommendation: LLMs can re-rank candidates effectively without fine-tuning, surpassing trained baselines like LightGCN in NDCG/Precision.
Traditional Recommenders are still vital: The LLM performs significantly worse if not primed with a high-quality candidate set or top-item context from a standard recommender.
Cross-domain transfer: Qualitative case studies show the system can successfully recommend books or games based on movie history, a capability hard for standard matrix factorization.
Temperature matters: Performance peaks around temperature 0.9, suggesting some randomness helps in exploring the solution space for ranking.

📚 Prerequisite Knowledge

Prerequisites

Collaborative Filtering
In-Context Learning (ICL)
Prompt Engineering
Cold-start problem in Recommender Systems

Key Terms

In-Context Learning (ICL): A paradigm where LLMs perform tasks by conditioning on input examples or instructions in the prompt without updating model parameters

Cold-start: The challenge of recommending items to new users or recommending new items with no prior interaction history

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items in the recommendation list

Zero-shot: Performing a task (like rating prediction) without having explicitly trained on examples of that specific task

RMSE: Root Mean Squared Error—a standard metric for measuring the differences between predicted values and observed values

AIGC: AI Generated Content