GenRec: Large Language Model for Generative Recommendation

📝 Paper Summary

Generative Recommendation LLM-based Recommendation

GenRec reframes recommendation as a conditional text generation task, fine-tuning a Large Language Model with LoRA to directly generate the next item's title from user history.

Core Problem

Traditional recommendation systems rely on discriminative ranking of item IDs, which struggles with sparsity and ignores rich semantic information in item names.

Why it matters:

Collaborative filtering fails on cold-start items lacking interaction history IDs
Ranking-based methods become computationally expensive as the candidate item pool grows
ID-based models discard the semantic knowledge LLMs have about item content (e.g., titles)

Concrete Example: In a movie recommendation scenario, a standard model sees a user watched items [ID_101, ID_405] and tries to rank [ID_888] against thousands of others. GenRec reads 'Pinocchio (1940), Legends of the Fall (1994)' and directly generates 'In the Line of Fire (1993)' by leveraging semantic understanding of the viewing habits.

Key Novelty

Generative Recommendation via Text Generation

Paradigm Shift: Instead of calculating a score for every candidate item (discriminative), the model generates the target item's name directly (generative)
Semantic Utilization: Uses the actual text of item titles as input/output rather than abstract numerical IDs, allowing the LLM to apply its pre-trained world knowledge

Architecture

The GenRec framework pipeline: Interaction Sequence → Prompt Formatting → LLaMA-LoRA Fine-tuning → Next Item Prediction

Evaluation Highlights

+3.46 percentage points HR@5 improvement over P5 baseline on MovieLens 25M (0.1034 vs 0.0688)
+2.52 percentage points NDCG@5 improvement over P5 baseline on MovieLens 25M (0.0716 vs 0.0464)
Demonstrates trade-offs: GenRec underperforms P5 on Amazon Toys, suggesting it relies heavily on rich semantic data (movie titles) absent in sparser datasets

Breakthrough Assessment

7/10

Significant step in the 'Generative Recommendation' paradigm, moving away from ranking. Strong results on text-rich datasets, though the performance drop on Amazon Toys indicates limitations in universality.

⚙️ Technical Details

Problem Definition

Setting: Sequential Recommendation as Next-Token Prediction

Inputs: Prompt containing instruction and textual history of user interactions (e.g., movie titles)

Outputs: Text string representing the name of the next item to be interacted with

Pipeline Flow

Prompt Construction (Instruction + Input History)
LLM Processing (LLaMA + LoRA)
Text Generation (Item Name Prediction)

System Modules

Prompt Constructor

Formats user interaction history into natural language prompts

Model or implementation: Template-based

LLM Backbone

Processes context and generates the next sequence

Model or implementation: LLaMA (with LoRA adapters)

Novel Architectural Elements

Purely generative output mechanism for recommendation: The architecture does not include a final ranking/softmax layer over a fixed item set, but rather a language head generating text tokens

Modeling

Base Model: LLaMA (7B implied by context)

Training Method: Supervised Fine-Tuning (SFT) with LoRA

Objective Functions:

Purpose: Minimize the difference between generated text and actual next item name.

Formally: Causal Language Modeling loss (Cross-Entropy on next token)

Adaptation: LoRA (Low-Rank Adaptation)

Training Data:

Sequence split: Most recent item for test, second-most for validation, remaining for training

Key Hyperparameters:

learning_rate: 3e-4
batch_size: 128
epochs: 5
+ 2 more
max_input_length: 256 tokens
warm_up_steps: 1000

Compute: Fine-tuned on a single GPU with 24GB memory (using LoRA). Experiments used 4x NVIDIA RTX A5000 GPUs.

Comparison to Prior Work

vs. P5: GenRec uses LLaMA (decoder-only) vs P5's T5 (encoder-decoder). GenRec focuses on generative item names while P5 handles multiple tasks including ranking.
vs. Traditional CF: GenRec is generative text-based vs. discriminative ID-based.

Limitations

Performance degradation on datasets with less semantic richness (Amazon Toys vs MovieLens)
Generative latency is typically higher than dot-product retrieval (implied)
Requires exact string matching for evaluation, which can be brittle (implied)
Significant GPU memory requirements compared to traditional matrix factorization

Reproducibility

Code: https://github.com/rutgerswiselab/GenRec

Code and data are open-sourced at https://github.com/rutgerswiselab/GenRec. Uses LLaMA backbone which requires access request. Detailed hyperparameters (LR, batch size) provided.

📊 Experiments & Results

Evaluation Setup

Sequential recommendation predicting the next item in a user's history

Benchmarks:

MovieLens 25M (Movie Recommendation)
Amazon Toys (Product Recommendation)

Metrics:

Hit Ratio (HR@5, HR@10)
NDCG (NDCG@5, NDCG@10)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MovieLens 25M Results: GenRec significantly outperforms the P5 baseline, likely due to rich semantic information in movie titles.
MovieLens 25M	HR@5	0.0688	0.1034	+0.0346
MovieLens 25M	NDCG@5	0.0464	0.0716	+0.0252
MovieLens 25M	HR@10	0.1040	0.1311	+0.0271
Amazon Toys Results: P5 outperforms GenRec, suggesting GenRec struggles when interaction information/text semantics are less robust.
Amazon Toys	HR@5	0.0239	0.0190	-0.0049
Amazon Toys	NDCG@5	0.0145	0.0136	-0.0009

Main Takeaways

GenRec demonstrates superior performance on datasets with rich textual semantics (MovieLens), effectively leveraging LLM knowledge.
The model underperforms baselines like P5 on sparser datasets (Amazon Toys), highlighting a dependency on high-quality text data for the generative approach.
LoRA fine-tuning enables LLaMA-based recommendation on consumer-grade hardware (24GB VRAM).

📚 Prerequisite Knowledge

Prerequisites

Collaborative Filtering (CF) vs Content-Based Filtering
Language Modeling (Next-token prediction)
Parameter-Efficient Fine-Tuning (PEFT)

Key Terms

Generative Recommendation: A paradigm where the system directly generates the identifier (or name) of the target item, rather than scoring a list of candidates

Discriminative Recommendation: Traditional approach that calculates a ranking score for each candidate item and sorts them to select recommendations

LoRA: Low-Rank Adaptation—a technique to fine-tune large models by training only a small number of extra parameters, reducing memory usage

HR: Hit Ratio—metric measuring the percentage of times the ground-truth item appears in the top-k recommendations

NDCG: Normalized Discounted Cumulative Gain—metric measuring ranking quality, giving higher scores to correct items appearing higher in the list

Cold Start: The difficulty of recommending items or to users with no prior interaction history

P5: Pre-train, Personalized Prompt, and Predict Paradigm—a baseline LLM-based recommendation framework using T5