Keyword-driven Retrieval-Augmented Large Language Models for Cold-start User Recommendations

📝 Paper Summary

Recommender Systems Cold-start Recommendation

KALM4Rec addresses cold-start recommendation by asking new users for keywords, retrieving candidates via a keyword-item graph, and re-ranking them using LLMs prompted with keyword profiles.

Core Problem

Traditional collaborative filtering fails for new (cold-start) users due to lack of interaction history, while LLMs struggle with token limits and hallucinations when processing full review text.

Why it matters:

Cold-start users are critical for platform growth but difficult to retain without personalized suggestions
Directly feeding user/item history into LLMs is token-expensive and prone to exceeding context windows
User reviews contain noise; extracting keywords captures preference essence more efficiently than full text

Concrete Example: A new user joins Yelp without history. Standard CF cannot recommend anything personalized. Asking them for keywords like 'sushi' and 'quiet' allows KALM4Rec to retrieve relevant spots, whereas a standard LLM prompted with 'recommend a restaurant' might hallucinate non-existent places.

Key Novelty

Keyword-driven Retrieval-Augmented Large Language Models for Cold-start User Recommendations (KALM4Rec)

Uses explicit keyword sets (e.g., 'sushi', 'quiet') instead of full reviews or dense vectors to represent user preferences and item characteristics, reducing token usage
Retrieves candidates via a graph (Message Passing on Graph) connecting keywords to items without requiring deep learning training parameters
Re-ranks items using LLMs prompted with these concise keyword profiles and few-shot examples

Architecture

The KALM4Rec framework showing the two-stage process: Candidate Retrieval via graph and Candidate Re-ranking via LLM.

Evaluation Highlights

Outperforms retrieval baselines (CLCRec, MVAE) on Yelp and TripAdvisor datasets in Recall@20 and Precision@20
LLM re-ranking (Gemini Pro 1.5) improves Recall@3 over retrieval-only methods, showing LLMs effectively refine keyword-based candidate lists
3-shot prompting consistently outperforms zero-shot and 1-shot strategies for the re-ranking task

Breakthrough Assessment

5/10

A practical application of LLMs to cold-start recommendation. The use of keywords to bridge user intent and LLM context is clever but the graph method is relatively simple and the scale is moderate.

⚙️ Technical Details

Problem Definition

Setting: Cold-start user recommendation using keyword queries

Inputs: A set of keywords provided by a new user k_{u_c}

Outputs: A ranked list of relevant items R_{u_c}

Pipeline Flow

Keyword Extraction (extracts nouns/adjectives from reviews)
Graph Construction (builds Keyword-Item graph)
Candidate Retrieval (MPG propagates user keywords to find items)
LLM Re-ranking (sorts top candidates using prompts)

System Modules

Keyword Extractor

Extract meaningful terms from raw reviews

Model or implementation: SpaCy (POS tagging)

MPG Retriever

Retrieve items relevant to user keywords via graph propagation

Model or implementation: Message Passing on Graph (unsupervised, weighted sum)

LLM Ranker

Re-rank the retrieved candidates based on user keywords

Model or implementation: Gemini Pro 1.5, GPT-3.5-Turbo, Mistral 8B, or Llama 3-8B

Novel Architectural Elements

Keyword-Item heterogeneous graph for retrieval where edge weights are statically calculated via TF-IRF (no learned embeddings)
Two-stage pipeline explicitly bridging sparse keyword inputs to LLM reasoning via retrieval

Modeling

Base Model: Gemini Pro 1.5 (best performing LLM ranker)

Training Method: Inference-only with In-Context Learning (Few-shot)

Compute: Inference on Colab Pro with L4 GPU; Retrieval is parameter-free graph traversal

Comparison to Prior Work

vs. CLCRec: KALM4Rec uses explicit graph propagation (MPG) instead of contrastive learning embeddings
vs. LightGCN: KALM4Rec uses static TF-IRF weights without training parameters, whereas LightGCN learns embeddings
vs. Standard RAG [not cited in paper]: Uses structured keywords rather than raw text chunks for retrieval context, reducing noise and token cost

Limitations

Relies on users explicitly providing high-quality keywords
Retrieval method (MPG) is heuristic-based (TF-IRF) and not learned, potentially missing complex latent interactions
Evaluation limited to Top-K retrieval/ranking metrics; user satisfaction with keywords not tested
Performance depends heavily on the underlying LLM capability (Gemini > Llama 3 in experiments)

Reproducibility

Code: https://github.com/dangkh/Kalm4rec-www

Code available at https://github.com/dangkh/Kalm4rec-www. Uses public Yelp and TripAdvisor datasets. Keyword extraction uses standard SpaCy. Retrieval uses static graph logic.

📊 Experiments & Results

Evaluation Setup

Cold-start recommendation where test users have no training history; they query using keywords.

Benchmarks:

Yelp.com dataset (Restaurant recommendation)
TripAdvisor dataset (Hotel recommendation)

Metrics:

Recall (R@K)
Precision (P@K)
NDCG (N@K)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Retrieval performance comparisons show the proposed MPG method outperforms baselines on Yelp data.
Yelp (Philadelphia)	Recall@20	0.0543	0.0664	+0.0121
Yelp (Philadelphia)	Precision@20	0.0033	0.0039	+0.0006
Re-ranking experiments demonstrate that adding an LLM ranker improves over pure retrieval.
Yelp (Philadelphia)	Recall@3	0.0124	0.0381	+0.0257
Yelp (Philadelphia)	Precision@3	0.0051	0.0152	+0.0101

Experiment Figures

Performance analysis of different LLMs, prompt strategies, and input formats.

Main Takeaways

Graph-based keyword propagation (MPG) effectively retrieves candidates without training deep learning models
LLM re-ranking significantly boosts top-K precision/recall compared to raw retrieval scores
Using keywords reduces context length requirements compared to full review text, enabling efficient LLM usage
Few-shot prompting (specifically 3-shot) provides the best guidance for the LLM ranker compared to zero-shot

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Recommender Systems (Collaborative Filtering)
Knowledge of Large Language Models (LLMs) and prompting strategies
Graph-based retrieval concepts

Key Terms

Cold-start user: A new user with no prior interaction history (clicks, ratings) on the platform, making personalized recommendation difficult

TF-IRF: Term Frequency-Inverse Review Frequency—a scoring scheme adapted from TF-IDF to weigh the importance of a keyword to a specific item relative to its frequency across all items

MPG: Message Passing on Graph—the retrieval method used here, where preference signals propagate from user keywords to item nodes in a heterogeneous graph

sBERT: Sentence-BERT—a modification of the BERT network that uses siamese networks to derive semantically meaningful sentence embeddings

LLM re-ranking: Using a Large Language Model to sort a small list of retrieved candidates by relevance, often using reasoning capabilities