Cross-Domain Recommendation Meets Large Language Models

📝 Paper Summary

Cross-Domain Recommendation (CDR) Cold-Start Recommendation

The paper investigates the reasoning capabilities of LLMs for cross-domain recommendation by designing specific prompts that outperform specialized neural baselines in rating and ranking tasks, particularly using only source domain data.

Core Problem

Single-domain recommender systems fail on cold-start users, and existing cross-domain solutions rely on complex architectures that struggle with data scarcity and lack general reasoning capabilities.

Why it matters:

Existing CDR models often memorize patterns rather than reasoning, limiting generalization beyond training data
Data-scarce scenarios make it difficult to train complex neural networks effectively
Current systems cannot easily incorporate rich textual/contextual information which LLMs handle naturally

Concrete Example: A user has rated many Books (source) but no Movies (target). A traditional CDR model needs complex mapping training to predict movie ratings. An LLM can simply read the book history and infer movie preferences via natural language reasoning without task-specific training.

Key Novelty

LLM-based Cross-Domain Prompting Framework

Replaces complex neural mapping architectures with natural language prompts that feed user interaction history directly to an LLM
Introduces two prompting strategies: one simulating warm-start (source + target history) and one simulating cold-start (source history only)
Demonstrates that LLMs can infer target domain preferences solely from source domain behavior, sometimes performing better without target domain noise

Evaluation Highlights

GPT-4 achieves MRR of 0.308 on Books->Movies ranking, outperforming the best baseline (PTUPCDR) which scored 0.2596
GPT-4o achieves NDCG of 0.412 on Movies->Music ranking, surpassing the best baseline (PTUPCDR) score of 0.3822
For rating prediction, llama-3-8b-instruct performs better when using *only* source domain data compared to combining source and target data

Breakthrough Assessment

7/10

Strong empirical results showing LLMs beat specialized baselines without training. The counter-intuitive finding that source-only data sometimes yields better rating predictions than source+target is a notable insight for the field.

⚙️ Technical Details

Problem Definition

Setting: Cross-Domain Recommendation (CDR) involving a source domain and a target domain with overlapping users.

Inputs: User interaction history (items and ratings) from source domain (and optionally target domain).

Outputs: Predicted rating for a candidate item (Rating Task) or a ranked list of candidate items (Ranking Task) in the target domain.

Pipeline Flow

Prompt Construction (User History Selection)
LLM Inference
Output Mapping (Text to Rating/Rank)

System Modules

Prompt Constructor

Selects recent user interactions (up to 10) from source/target domains and formats them into a textual template defining the recommender role

Model or implementation: Rule-based script

LLM Reasoner

Processes the interaction history and generates a prediction based on learned world knowledge and reasoning

Model or implementation: GPT-4 / GPT-4o / Llama-3-8B-Instruct

Output Mapper

Converts LLM textual output into numerical ratings or ordered lists for evaluation

Model or implementation: Rule-based mapping

Novel Architectural Elements

Prompt-based bridging of domains: Using natural language context to transfer preferences instead of latent vector mapping or neural networks

Modeling

Base Model: GPT-4, GPT-4o, GPT-3.5-turbo, Llama-2-7b/13b, Llama-3-8b-Instruct

Compute: Inference-only. No training compute reported.

Comparison to Prior Work

vs. PTUPCDR/EMCDR: LLMs require no training or domain-specific embeddings; they rely on general pre-trained knowledge
vs. UniCDR: LLMs can handle cold-start (source-only) scenarios via reasoning rather than just shared latent structures
vs. Chat-REC [not cited in paper]: Focuses specifically on Cross-Domain transfer via prompting rather than general conversational recommendation

Limitations

LLMs struggle with completely unrelated domain pairs (e.g., Electronics -> Food)
Cost and latency of using large commercial LLMs (GPT-4) for every recommendation
Requires mapping textual outputs back to numerical ratings, which can be imprecise

Reproducibility

Code: https://github.com/ajaykv1/CDR_Meets_LLMs

📊 Experiments & Results

Evaluation Setup

Cross-domain rating prediction and ranking on Amazon Reviews subsets

Benchmarks:

Amazon Reviews (Books -> Movies) (CDR Rating/Ranking)
Amazon Reviews (Movies -> Music) (CDR Rating/Ranking)
Amazon Reviews (Electronics -> Food) (CDR Rating/Ranking)

Metrics:

RMSE (Root Mean Squared Error)
MAE (Mean Absolute Error)
MRR@10 (Mean Reciprocal Rank)
NDCG@10 (Normalized Discounted Cumulative Gain)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ranking task results showing LLM performance vs baselines across domain pairs.
Amazon Reviews Pair 1	MRR@10	0.2596	0.308	+0.0484
Amazon Reviews Pair 1	NDCG@10	0.3646	0.383	+0.0184
Amazon Reviews Pair 2	MRR@10	0.2611	0.346	+0.0849
Amazon Reviews Pair 2	NDCG@10	0.3822	0.412	+0.0298
Rating task results demonstrating LLM proficiency in rating prediction, sometimes even without target data.
Amazon Reviews Pair 1	RMSE	1.0655	0.783	-0.2825
Amazon Reviews Pair 2	RMSE	0.9859	0.641	-0.3449

Experiment Figures

Bar charts comparing performance of 'Medium Context' vs 'High Context' prompts across domain pairs.

Main Takeaways

LLMs outperform state-of-the-art CDR baselines in ranking and rating tasks for intuitively similar domain pairs (e.g., Books->Movies).
For rating prediction, LLMs often perform better using *only* source domain data ('no target injection') than when target data is included, suggesting target data might introduce noise for the LLM's reasoning process.
High-context prompts (detailed task descriptions) consistently outperform medium-context prompts.
LLMs struggle to generalize in completely unrelated domain pairs (Electronics->Food), performing worse than baselines in ranking tasks.

📚 Prerequisite Knowledge

Prerequisites

Basics of Recommender Systems (collaborative filtering, cold-start)
Large Language Models (prompting, context windows)
Evaluation metrics (RMSE, NDCG, MRR)

Key Terms

CDR: Cross-Domain Recommendation—leveraging data from a source domain to improve recommendations in a target domain

Cold-start: The problem where a system cannot recommend items to new users due to lack of prior interaction history

Prompt Engineering: Designing natural language inputs to guide an LLM to perform a specific task

RMSE: Root Mean Squared Error—a standard metric for rating prediction accuracy

NDCG: Normalized Discounted Cumulative Gain—a ranking metric that accounts for the position of relevant items

MRR: Mean Reciprocal Rank—a metric that focuses on the rank of the first relevant item

Matrix Factorization: A technique that decomposes the user-item interaction matrix into lower-dimensional latent factors

Target Domain Behavior Injection: A prompt strategy that includes user history from the target domain (simulating warm-start)

No Target Domain Behavior Injection: A prompt strategy that includes user history ONLY from the source domain (simulating cold-start)