LLM-Rec: Personalized Recommendation via Prompting Large Language Models

📝 Paper Summary

Text-based Recommendation Data Augmentation with LLMs

LLM-Rec improves text-based recommendation by using Large Language Models to generate enriched item descriptions—incorporating user engagement data—allowing simple MLP models to outperform complex content-based architectures.

Core Problem

Text-based recommendation systems struggle with incomplete or generic item descriptions that fail to explicitly capture attributes relevant to specific user preferences.

Why it matters:

Original item descriptions often lack crucial details (e.g., tone, specific dietary restrictions) needed for accurate personalization
Generic descriptions are not tailored to specific user groups, leading to misalignment between item characteristics and user needs
Performance of recommendation models is heavily bottlenecked by the quality and richness of the input text

Concrete Example: A user follows a vegan diet. A recipe description lists ingredients but lacks the explicit tag 'vegan'. A standard recommender misses this match due to insufficient text. LLM-Rec infers 'vegan' from the ingredients via prompting, enabling the system to recommend the recipe.

Key Novelty

Prompt-based Description Enrichment

Uses LLMs as a data augmentation tool to paraphrase, tag, and infer emotions from item descriptions before they enter the recommendation model
Introduces 'Engagement-guided Prompting', which includes descriptions of neighbor items (items the user also engaged with) in the prompt, guiding the LLM to identify attractive commonalities

Evaluation Highlights

+21.72% improvement in NDCG@10 on the Recipe dataset using GPT-3 augmented text compared to the standard MLP baseline
Simple MLP models using LLM-Rec augmented text outperform complex state-of-the-art content-based models like EDCN and DCN-V2
Llama-2-7B achieves comparable performance to GPT-3 (text-davinci-003), demonstrating the effectiveness of open-source models for this task

Breakthrough Assessment

7/10

Offers a significant practical improvement by demonstrating that better input data (via LLMs) allows simpler models to beat complex architectures. The engagement-guided prompting is a clever, domain-agnostic innovation.

⚙️ Technical Details

Problem Definition

Setting: Top-K Recommendation based on implicit feedback and textual item content

Inputs: User-item interaction history and original item text descriptions

Outputs: Ranked list of items for a target user

Pipeline Flow

Prompt Construction (Selects strategy: Basic, Rec-driven, or Engagement-guided)
Text Augmentation (LLM Inference to generate enriched text)
Recommendation (Feed enriched text into MLP for scoring)

System Modules

Prompt Constructor

Constructs the input prompt for the LLM based on the selected strategy (e.g., retrieving neighbor items for engagement-guided prompts)

Model or implementation: Rule-based templates

Text Augmentor

Generates enriched item descriptions, tags, or emotional inferences

Model or implementation: GPT-3 (text-davinci-003) or Llama-2-7B-Chat

Recommender

Predicts user preference scores based on the augmented text

Model or implementation: MLP (Multi-Layer Perceptron)

Novel Architectural Elements

Integration of interaction-based neighbors into the *text generation prompt* (Engagement-guided prompting) rather than just the recommendation model geometry

Modeling

Base Model: GPT-3 (text-davinci-003) and Llama-2-7B-Chat (for augmentation)

Training Method: The LLM is used in inference-only mode (prompting). The downstream MLP recommender is trained via standard supervised learning.

Key Hyperparameters:

recommendation_list_size_K: 10
negative_sample_ratio: 1 (training), 1000 (testing)

Compute: Not reported in the paper

Comparison to Prior Work

vs. TagGPT: LLM-Rec uses engagement-guided prompting (neighbors) rather than just zero-shot tagging
vs. KAR: LLM-Rec focuses on text enrichment passed to simple models rather than complex reasoning architectures
vs. EDCN/DCN-V2: LLM-Rec achieves better results with a simpler MLP backbone by improving the input quality

Limitations

Dependency on the quality and cost of LLM API calls (for GPT-3)
Latency concerns during the augmentation phase (though this can be pre-computed)
Potential hallucinations in generated text, though the paper argues the engagement-guided context mitigates this

Reproducibility

Prompt templates are provided in the paper. Dataset details (MovieLens-1M, Recipe) are standard. Code URL is not explicitly provided in the text. Evaluation protocol follows Wei et al. (2019).

📊 Experiments & Results

Evaluation Setup

Personalized top-K recommendation

Benchmarks:

MovieLens-1M (Movie Recommendation)
Recipe (Recipe Recommendation)

Metrics:

NDCG@10
Precision@10
Recall@10
Statistical methodology: Reported mean and standard deviation across five different splits

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on MovieLens-1M showing LLM-Rec improvements over MLP baseline and complex content methods.
MovieLens-1M	NDCG@10	0.3640	0.3867	+0.0227
MovieLens-1M	NDCG@10	0.3640	0.3951	+0.0311
Performance on Recipe dataset showing larger gains due to sparse original metadata.
Recipe	NDCG@10	0.0580	0.0706	+0.0126
Recipe	NDCG@10	0.0652	0.0706	+0.0054
MovieLens-1M	NDCG@10	0.3824	0.3951	+0.0127

Main Takeaways

Augmented text significantly enhances recommendation quality, especially for datasets with sparse or incomplete descriptions (Recipe).
LLM-Rec enables simple MLP models to outperform complex feature-interaction models (AutoInt, DCN-V2, EDCN), suggesting input quality is more critical than model complexity.
Engagement-guided prompting is highly effective, leveraging user behavior to guide the LLM's text generation.
Open-source models (Llama-2) are competitive with proprietary models (GPT-3) for this augmentation task.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Recommender Systems (Content-based filtering)
Familiarity with Large Language Models and Prompt Engineering
Evaluation metrics for ranking (NDCG, Recall, Precision)

Key Terms

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items in the recommendation list

MLP: Multi-Layer Perceptron—a simple feedforward neural network used here as the core recommendation model

Zero-shot prompting: Providing a task to an LLM without giving it any examples of the desired output, relying on its pre-trained knowledge

Engagement-guided prompting: A strategy where the LLM is prompted with the target item and 'neighbor' items (items with high user engagement) to find common attractive traits

AutoInt: A recommender baseline model that uses self-attention to learn high-order feature interactions

EDCN: Enhanced Deep Cross Network—a complex deep learning baseline for recommendation

Pseudo-negative samples: Unobserved user-item pairs assumed to be negative (irrelevant) for training purposes