A Review of Modern Recommender Systems Using Generative Models (Gen-RecSys)

📝 Paper Summary

Generative Recommender Systems LLM-based Recommendation

Gen-RecSys replaces traditional discriminative recommendation models with generative paradigms that sample user preferences from complex distributions (interactions, text, images) to enable novel tasks like zero-shot and conversational recommendation.

Core Problem

Traditional recommender systems (RS) act as 'narrow experts' relying solely on user-item ratings, limiting their ability to handle complex multimodal data or perform generalized tasks without domain-specific training.

Why it matters:

Narrow expert models struggle with cold-start problems and cannot easily adapt to new tasks (e.g., explanation generation) without retraining
Discriminative models miss the rich semantic information available in text, images, and videos that modern generative models can leverage
Current surveys often focus narrowly on LLMs or specific architectures (like GANs), lacking a holistic view of the generative recommendation landscape across modalities

Concrete Example: A traditional matrix factorization model can predict a rating for a movie but cannot explain 'why' the user might like it or generate a new movie poster tailored to the user's taste. Gen-RecSys using an LLM can reason: 'You liked generic sci-fi, so try this cyberpunk movie because it features similar dystopian themes.'

Key Novelty

Unified Taxonomy for Gen-RecSys

Classifies systems by data modality: Interaction-Driven (structure-only), Text-Driven (LLMs/NLP), and Multimodal (Text+Image/Video)
Distinguishes between 'Directly Trained' models (learning distributions from scratch on interactions) and 'Pretrained' models (adapting foundation models via fine-tuning or prompting)
Integrates evaluation of 'impact and harm' alongside standard accuracy metrics, addressing the generative nature of new risks

Architecture

A hierarchical taxonomy of the Gen-RecSys survey structure

Evaluation Highlights

Not reported in the paper (Survey paper without new empirical benchmarks)
Provides a structured review of existing literature rather than comparative performance metrics

Breakthrough Assessment

8/10

Comprehensive survey that establishes a necessary taxonomy for a rapidly exploding field. While it doesn't propose a new model, it structures the chaotic landscape of LLMs and generative models in RecSys.

⚙️ Technical Details

Problem Definition

Setting: Recommender Systems utilizing Generative Models to model and sample from data distributions p(x)

Inputs: User-item interaction histories, textual data (reviews, descriptions), visual data (images, videos)

Outputs: Recommended items, generated explanations, conversational responses, or synthetic user-item interactions

Pipeline Flow

Data Source Selection (Interaction / Text / Multimodal)
Model Selection (Auto-Encoding / Auto-Regressive / GAN / Diffusion)
Adaptation Strategy (Direct Training / Fine-Tuning / Prompting)
Inference Task (Ranking / Generation / Explanation)

System Modules

Interaction-Driven Models (Modality-Specific Architectures)

Learn preference distributions solely from user-item interaction matrices or sequences

Model or implementation: Various (VAE-CF, BERT4Rec, GRU4Rec, DiffRec)

Text-Driven Models (LLMs) (Modality-Specific Architectures)

Leverage semantic knowledge and reasoning capabilities of LLMs for recommendation

Model or implementation: LLMs (e.g., GPT-series, LLaMA)

Multimodal Models (Modality-Specific Architectures)

Integrate visual and textual data to enhance item representation and recommendation

Model or implementation: Multimodal encoders/generators

Novel Architectural Elements

Taxonomy structure categorizing models by Data Modality (Interaction, Text, Multimodal) crossed with Model Paradigm (Direct vs. Pretrained)
Unified view of generative tasks: converting standard ranking problems into generative sampling problems

Modeling

Base Model: Survey covers multiple architectures: VAEs, GANs, Diffusion Models, Transformers (LLMs)

Training Method: Varies by sub-field (Direct Training for VAEs, Fine-Tuning/Prompting for LLMs)

Objective Functions:

Purpose: Reconstruct input interactions (Autoencoders).

Formally: Minimizing reconstruction loss (often MSE or Cross-Entropy).
Purpose: Maximize evidence lower bound (VAEs).

Formally: ELBO = E[log p(x|z)] - KL(q(z|x)||p(z)).
Purpose: Predict next token/item (Auto-regressive).

Formally: Maximizing likelihood p(x_i | x_<i).
Purpose: Denoise corrupted inputs (Diffusion).

Formally: Predicting noise added to the input at timestep t.

Adaptation: Fine-tuning (adjusting weights on RecSys data), Prompting (In-Context Learning), RAG (Retrieval-Augmented)

Trainable Parameters: Varies: Full model (Direct Training) vs. Adapters/None (Pretrained LLMs)

Comparison to Prior Work

vs. Traditional RS: Gen-RecSys can generate new content/explanations, not just rank existing items
vs. Discriminative RS: Gen-RecSys models the full data distribution p(x) allowing for sampling and handling complex modalities (text/image), whereas discriminative models estimate p(y|x) purely for classification/regression
vs. Previous Surveys: This survey covers the full spectrum (Interaction, Text, Multimodal) rather than just LLMs or GANs [not cited in paper as a specific model comparison, but as survey differentiation]

Limitations

Survey scope is broad, potentially sacrificing depth on specific sub-architectures
Field is moving extremely fast; specific LLM benchmarks mentioned may become outdated quickly
Focus is on categorization and review rather than proposing a novel solution to a specific problem

Reproducibility

Code: https://encr.pw/vDhLq

The paper is a survey and does not propose a single model to reproduce. However, it provides a tutorial link (https://encr.pw/vDhLq) containing supporting materials and potentially code examples for the discussed methods.

📊 Experiments & Results

Evaluation Setup

Review of evaluation methodologies for Gen-RecSys

Benchmarks:

Standard RecSys Benchmarks (Rating Prediction / Top-K Recommendation)

Metrics:

Accuracy (Recall, NDCG)
Diversity
Fairness
Explanation Quality (BLEU, ROUGE)
Safety/Harms
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Generative models enable new capabilities beyond ranking, such as explanation generation and conversational interaction.
Evaluation must expand beyond accuracy to include impact, harm, and fairness, especially for powerful LLMs.
Pretrained models offer strong zero/few-shot performance but require careful adaptation (fine-tuning/RAG) for domain-specific tasks.
Multimodal Gen-RecSys is an emerging frontier, leveraging image and video generation for enhanced user experience.

📚 Prerequisite Knowledge

Prerequisites

Basic Recommender Systems concepts (Collaborative Filtering, Matrix Factorization)
Generative Deep Learning architectures (VAE, GAN, Diffusion Models)
Large Language Models (Transformers, Fine-tuning, Prompting)

Key Terms

Gen-RecSys: Recommender Systems that utilize Generative Models to learn data distributions and sample outputs

VAE-CF: Variational AutoEncoders for Collaborative Filtering—a generative model that learns the probability distribution of items a user likes

ICL: In-Context Learning—the ability of LLMs to learn tasks from a few examples in the prompt without parameter updates

RAG: Retrieval-Augmented Generation—combining a retriever to fetch relevant documents with a generator to produce answers or recommendations

GAN: Generative Adversarial Network—a framework with a generator and discriminator competing to produce realistic synthetic data

Diffusion Models: Generative models that create data by reversing a noise-adding process, used in RecSys for sequence augmentation or preference prediction

Auto-Regressive Models: Models that predict the next token (or item) in a sequence based on previous ones, widely used in sequential recommendation

Denoising Autoencoders: Models trained to recover original inputs from corrupted versions, used to learn robust user/item representations (e.g., BERT4Rec)

CVAE: Conditional Variational Autoencoder—a VAE variant that generates outputs conditioned on specific attributes (e.g., generating a recommendation list for a specific user)

Zero-shot Learning: The ability of a model to perform a task it wasn't explicitly trained for, often via prompting an LLM