Zuoli Tang, Zhaoxin Huan, Zihao Li, Xiaolu Zhang, Jun Hu, Chilin Fu, Jun Zhou, Lixin Zou, Chenliang Li
School of Cyber Science and Engineering, Wuhan University,
Ant Group
arXiv
(2023)
RecommendationP13N
📝 Paper Summary
Sequential RecommendationCross-domain RecommendationLLM for Recommendation
LLM-Rec leverages the world knowledge and semantic understanding of pre-trained Large Language Models to unify user behaviors across multiple domains into a single sequence, addressing data sparsity and cold-start problems without complex domain-specific architectures.
Core Problem
Traditional multi-domain recommendation systems struggle with data sparsity and cold-start issues because they rely on ID-based representations that lack semantic meaning and fail to align items across domains.
Why it matters:
Current cross-domain methods require complex, rigid architectures (e.g., pair-wise links) that scale poorly to many domains
ID-based methods cannot transfer semantic knowledge; a user's interest in 'running shoes' in one domain doesn't naturally map to 'sports drinks' in another without explicit overlap
Existing sequential models often fail to capture long-term dependencies or semantic correlations between diverse user interests
Concrete Example:In a preliminary study, simply concatenating item IDs from five different domains and feeding them into SASRec resulted in performance degradation compared to single-domain models, proving that ID-based methods fail to capture cross-domain semantic connections.
Key Novelty
LLM-Rec: Domain-Agnostic LLM Framework
Treats multi-domain recommendation as a text-to-text problem by converting item titles into text and concatenating them into a single user history sentence
Uses a single pre-trained LLM backbone to encode both user history and candidate items, relying on the LLM's internal 'world knowledge' to bridge semantic gaps between domains
Demonstrates that larger model sizes (scaling laws) and instruction tuning (LoRA) significantly benefit recommendation performance, unlike traditional ID-based models
Architecture
The overall framework of LLM-Rec, illustrating how user behaviors from different domains are concatenated into a text sequence, processed by various LLM backbones (Encoder-only, Decoder-only, Encoder-Decoder), and used for next-item prediction.
Evaluation Highlights
Outperforms state-of-the-art baselines like SASRec and UniSRec on 5 diverse datasets, with gains particularly strong in sparse/cold-start scenarios
Larger models yield better performance: scaling from 125M to 6.7B parameters consistently improves recommendation accuracy, confirming NLP scaling laws apply here
Fine-tuning with LoRA achieves comparable or better results than full parameter tuning while requiring significantly fewer trainable parameters
Breakthrough Assessment
7/10
Strong empirical validation of LLMs for multi-domain recommendation without complex graph/task structures. Successfully applies NLP scaling laws to RecSys, though the architectural innovation is primarily the application of existing LLMs to a new setting.
⚙️ Technical Details
Problem Definition
Setting: Multi-domain Sequential Recommendation
Inputs: A user u's mixed sequence of interactions s_u = (v_1, v_2, ..., v_L') from all available domains, represented by item titles
Outputs: Probability of the next item v_{L'+1} in a target domain D^T
Pipeline Flow
Input Construction (Textualization)
LLM Backbone Encoding
Score Prediction
System Modules
Input Construction
Convert item IDs to text titles and concatenate user history into a single sequence
Model or implementation: Tokenizer (specific to LLM backbone)
LLM Backbone
Generate dense vector representations for the user (based on history) and the item (based on title)
Model or implementation: Various: BERT (Encoder-only), OPT (Decoder-only), FLAN-T5 (Encoder-Decoder)
Prediction Head
Calculate relevance score between user and candidate item
Model or implementation: Dot product
Novel Architectural Elements
Unified text-based input format that mixes heterogeneous domain items into a single natural language sequence
Application of decoder-only (OPT) and encoder-decoder (FLAN-T5) architectures for embedding-based recommendation (rather than generative recommendation)
Modeling
Base Model: Evaluated multiple: BERT (Medium/Base/Large), OPT (125M to 6.7B), FLAN-T5 (Small to XL)
Training Method: Supervised Fine-Tuning (SFT) with Cross-Entropy Loss
Objective Functions:
Purpose: Maximize the probability of the ground-truth next item while minimizing probabilities of negative samples.
Formally: Cross-entropy loss L = - ∑ log(σ(u_j · v_j)) for positive samples and log(1 - σ(u_j · v_neg)) for negative samples.
Adaptation: LoRA (Low-Rank Adaptation) and Full Fine-tuning compared
Trainable Parameters: Varies by setting; LoRA trains <1% of parameters
Key Hyperparameters:
negative_samples: 1
batch_size: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper
Compute: Experiments conducted on models up to 6.7B parameters
Comparison to Prior Work
vs. SASRec: LLM-Rec uses text instead of IDs and a single model for all domains
vs. UniSRec: LLM-Rec processes multi-domain data in a single stage rather than pre-train/fine-tune
vs. Recformer: LLM-Rec investigates decoder-only and larger scale models (up to 6.7B) rather than just encoder-only architectures
Computational cost of inference with 6.7B parameter models is significantly higher than ID-based embeddings
Requires item titles to be meaningful; may fail if text descriptions are poor quality
Maximum sequence length of LLMs may limit the length of user history considered
No specific details provided on latency or real-time serving feasibility
Reproducibility
No code URL provided in the paper. Datasets are real-world but specific sources/preprocessing details are standard. Hyperparameters like batch size and learning rate are missing from the text.
📊 Experiments & Results
Evaluation Setup
Next-item prediction on mixed multi-domain sequences
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Ablation and model scaling experiments demonstrate the impact of model size and fine-tuning strategies.
5-Domain Average
NDCG@10
0.3347
0.4578
+0.1231
5-Domain Average
NDCG@10
0.2650
0.2710
+0.0060
Experiment Figures
Performance comparison (NDCG@10) of SASRec trained on single domains vs. SASRec trained on mixed multi-domain data.
Main Takeaways
Simply merging interaction data (ID-based) from multiple domains degrades performance, but processing it as text via LLMs improves performance, proving LLMs bridge semantic gaps.
Scaling Law holds: Performance consistently improves as the model size increases from 125M to 6.7B parameters.
LLMs are particularly effective for Cold-Start items and unpopular items, relying on semantic understanding rather than collaborative signals.
Decoder-only architectures (like OPT) generally perform well, and LoRA is an effective and efficient tuning strategy compared to full fine-tuning.
Sequential Recommendation: Predicting the next likely item a user will interact with based on their chronological history of past interactions
Cold-start: The difficulty of recommending items to new users or recommending new items that have few or no prior interactions
LoRA: Low-Rank Adaptation—a technique to fine-tune large models by training small rank-decomposition matrices while keeping the main model weights frozen
Encoder-only: Transformer architectures like BERT that use bi-directional attention, suitable for understanding context but not generating text
Decoder-only: Transformer architectures like GPT/OPT that use uni-directional attention (causal), typically used for text generation
Encoder-Decoder: Transformer architectures like T5 that process input with an encoder and generate output with a decoder
SASRec: Self-Attentive Sequential Recommendation—a standard baseline model that uses a Transformer encoder to model user interaction sequences based on item IDs