Uncovering Cross-Domain Recommendation Ability of Large Language Models

📝 Paper Summary

Cross-Domain Recommendation (CDR) Zero-shot Recommendation

LLM4CDR leverages Large Language Models to transfer user preferences from source to target domains without training, using purchase history sequences and generated cross-domain feature guidance.

Core Problem

Traditional Cross-Domain Recommendation (CDR) systems struggle when the target domain has absolutely no user history (cold-start) and require extensive training data to align domains.

Why it matters:

Data sparsity and cold-start problems prevent effective recommendations in new or low-resource domains
Existing CDR methods usually require some overlapping user/item interaction data to train embeddings, which may not exist for new domains
The potential of LLMs to perform zero-shot knowledge transfer between disparate domains using only text history is underexplored

Concrete Example: A user has bought many 'CD & Vinyl' items but has never interacted with the 'Movies & TV' section. A standard recommender cannot suggest movies. LLM4CDR uses the CD history (e.g., 'Pink Floyd albums') to infer preferences (e.g., 'Musical Documentaries') and rank movie candidates.

Key Novelty

LLM4CDR Pipeline

Treats recommendation as a conditional text prediction task where source domain history acts as the context for target domain ranking
Utilizes a two-step prompting strategy where the LLM first generates 'recommendation guidance' (shared features like genre or theme) to bridge the domain gap before making predictions

Architecture

The LLM4CDR pipeline illustrating the prompt construction process.

Evaluation Highlights

+64.28% improvement in MAP@1 when using GPT-3.5 to transfer knowledge from 'CD & Vinyl' to 'Movies & TV' compared to using no source history
Performance gains are strictly dependent on domain proximity; utilizing source history from unrelated domains (large domain gap) actually degrades performance
Larger models (GPT-4) show significantly higher performance gains from cross-domain context compared to smaller models or Ollama-based local models

Breakthrough Assessment

7/10

A strong empirical study demonstrating LLMs' zero-shot transfer ability in recommendation. While the pipeline is prompt-based rather than architectural, the findings on domain gaps and 'guidance' prompts are valuable for practical deployment.

⚙️ Technical Details

Problem Definition

Setting: Cross-Domain Recommendation (CDR) as a conditional ranking task

Inputs: User purchase history H in source domain (sequence of item titles), Candidate list C in target domain (3 ground truth + negative samples)

Outputs: Ranked list of items from C based on purchase probability

Pipeline Flow

Information Preprocessing: Generate History & Candidate Lists
Recommendation Guidance Generation: LLM identifies shared features
Prompt Generation: Combine context, guidance, and task description
Recommendation Result Generation: LLM prediction and parsing

System Modules

History & Candidate Generator

Selects active users and formats their purchase history as text sequences; samples negative items for the candidate list

Model or implementation: Rule-based scripts

Guidance Generator

Generates a textual description of common features (e.g., 'genre', 'theme') to consider between the specific source and target domains

Model or implementation: LLM (e.g., GPT-3.5)

Ranker

Predicts the user's preferred item from the candidate list based on history and guidance

Model or implementation: GPT-3.5 / GPT-4 / Ollama (local model)

Output Parser

Extracts item titles from LLM text output and handles formatting errors

Model or implementation: Regex / String Matching

Novel Architectural Elements

Two-stage prompting pipeline where an explicit 'Guidance Generation' step creates a bridge of shared features (genre, theme) between domains before the actual ranking step

Modeling

Base Model: Evaluated on GPT-3.5, GPT-4, and 'Ollama' (specific local model version not detailed in text)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Embedding-based CDR: LLM4CDR requires no training or overlapping user history in the target domain, relying purely on semantic text transfer via LLMs
vs. Content-based CDR: LLM4CDR uses generative 'guidance' to dynamically identify shared features rather than relying on static tag correlations

Limitations

Dependency on domain proximity: Performance degrades significantly if source and target domains are not semantically related (e.g., Electronics to Movies)
Context window limits: Requires a limited candidate list (e.g., 20 items) because LLMs cannot process the full item catalog in the prompt
Parsing errors: LLMs occasionally hallucinate items or produce format mismatches requiring regex post-processing
Cost/Latency: Relying on large models (GPT-4) for every recommendation is computationally expensive compared to embedding-based methods

📊 Experiments & Results

Evaluation Setup

Re-ranking task on Amazon Review Data subsets (Movies, Games, CD, Electronics)

Benchmarks:

Amazon Review Data (2018) (Top-K Recommendation (Re-ranking))

Metrics:

HIT@K (H@K)
MAP@K (Mean Average Precision)
NDCG@K (Normalized Discounted Cumulative Gain)
Statistical methodology: Average of three repeated runs reported with standard deviation

Experiment Figures

Comparison of LLM4CDR performance (NDCG, MAP, HIT) across different history sequence lengths (20, 30, 40 items).

Impact of 'Recommendation Guidance' on performance metrics across different domain pairs.

Main Takeaways

Small domain gaps are essential: Transferring between 'CD & Vinyl' and 'Movies & TV' (same sub-group) yields massive gains (+64.28% MAP@1), while transferring from 'Electronics' to 'Movies' (large gap) hurts performance.
Guidance matters: Explicitly prompting the LLM with 'recommendation guidance' (shared features like genre) improves metrics compared to using history alone.
History length trade-off: Very short history (20 items) is insufficient for confident transfer, but increasing history helps target domain recommendation up to a point.
Model scale correlation: Performance gains are positively correlated with model size; GPT-4 significantly outperforms GPT-3.5 and smaller local models (Ollama) in utilizing cross-domain context.

📚 Prerequisite Knowledge

Prerequisites

Basics of Recommender Systems (Cold-start problem)
Prompt Engineering (Context, Few-shot)
Ranking Metrics (MAP, NDCG)

Key Terms

CDR: Cross-Domain Recommendation—techniques to improve recommendations in a target domain using data from a source domain

Cold-start problem: The difficulty of recommending items to users who have no prior interaction history in the system

MAP@K: Mean Average Precision at K—a metric that measures the quality of the ranking by considering the position of relevant items in the top K results

NDCG@K: Normalized Discounted Cumulative Gain at K—a ranking metric that gives higher importance to correct items appearing earlier in the list

Bootstrapping: A statistical method used here to shuffle the order of candidate lists to prevent position bias in LLM predictions

Zero-shot: The ability of a model to perform a task (here, recommendation in a new domain) without having been explicitly trained on examples of that specific task