Can Large Language Models Assess Serendipity in Recommender Systems?

📝 Paper Summary

Recommender Systems (RS) Evaluation LLMs as Judges / LLM-based Evaluation

LLMs can assess serendipity in recommendations better than random baselines but struggle to align highly with human judgments, with performance heavily dependent on prompt design and history length.

Core Problem

Evaluating serendipity (unexpected relevance) in recommender systems is difficult because it is subjective, emotional, and typically requires costly user surveys rather than simple engagement metrics.

Why it matters:

Traditional accuracy metrics promote over-specialization (filter bubbles), discouraging users from exploring new interests.
Obtaining ground-truth serendipity labels via human surveys is expensive, inconsistent, and unscalable.
A reliable automated proxy for human serendipity judgment is needed to optimize and evaluate serendipity-oriented algorithms at scale.

Concrete Example: A user watches 'War Dogs' and is recommended 'Gosford Park'. Humans might find this serendipitous (unexpected but liked). An LLM needs to predict this 'Yes/No' judgment based only on the user's past movie ratings, a task where standard accuracy metrics fail.

Key Novelty

LLM-based Serendipity Assessment (LSA)

Proposes using LLMs (GPT-3.5, GPT-4, Llama2) as binary classifiers to predict if a user finds a recommended item serendipitous.
Investigates four prompt variations: implicit (titles only), explicit (titles+ratings), implicit+genres, and explicit+genres to determine optimal input context.
Evaluates alignment with human ground truth from the Serendipity-2018 dataset, comparing LLMs against traditional recommender baselines.

Architecture

The proposed framework for serendipity assessment using LLMs.

Evaluation Highlights

GPT-4 achieves the highest accuracy (0.876) among LLMs, outperforming random guessing and standard baselines like 'all negative'.
LLMs generally struggle with Precision for the minority 'serendipitous' class (GPT-4 Precision: 0.207), often failing to identify true serendipity despite high accuracy.
The method outperforms the Serendipity-Oriented Greedy (SOG) baseline algorithm in classification tasks, suggesting LLMs capture nuances better than simple heuristic re-ranking scores.

Breakthrough Assessment

4/10

First exploration of LLMs for this specific metric, showing potential but low agreement with humans. It highlights the difficulty of the task rather than solving it completely.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of a tuple (user history, recommended item) into serendipitous (1) or not (0).

Inputs: User rating history I_u (sequence of items) and query item i.

Outputs: Binary value f_LLM indicating if item i is serendipitous for user u.

Pipeline Flow

History Extractor
Prompt Generator
LLM Inference
Output Parser

System Modules

History Extractor (Input Processing)

Selects user rating history

Model or implementation: Script

Prompt Generator (Input Processing)

Formats input into text prompts

Model or implementation: Template-based

LLM Inference

Binary classification of serendipity

Model or implementation: GPT-3.5, GPT-4, or Llama2-13B-Chat

Novel Architectural Elements

Prompt engineering specific to serendipity assessment incorporating rating history and item metadata (genre/rating) in four variations.

Modeling

Base Model: Evaluated multiple: GPT-3.5-turbo-0613, GPT-4-0613, Llama2-13B-Chat

Training Method: In-context learning (Few-shot prompting)

Key Hyperparameters:

temperature: 0.0
few_shot_examples: 2 (1 positive, 1 negative)
history_length: 10 items

Compute: Not reported in the paper

Comparison to Prior Work

vs. SOG: LLM assesses serendipity based on semantic knowledge/reasoning vs. SOG's mathematical distance metrics.
vs. SVD: Assesses 'surprise' quality rather than just rating prediction accuracy.

Limitations

Low agreement with human ground truth (low precision/recall for minority class).
Sensitive to prompt variations (implicit vs explicit inputs).
High cost of using commercial LLMs (GPT-4) for large-scale evaluation.
Limited context window usage (only recent 10 items used).

Reproducibility

Prompt templates are fully provided in Figure 2. Dataset is publicly available (Serendipity-2018). Code URL is not provided.

📊 Experiments & Results

Evaluation Setup

Binary classification on Serendipity-2018 dataset (277 serendipitous vs 1,873 non-serendipitous samples).

Benchmarks:

Serendipity-2018 (Binary Classification (Serendipitous vs. Non-Serendipitous))

Metrics:

Accuracy
Precision (Macro)
Recall (Macro)
F1-score (Macro)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of LLM classification performance against baselines. Note that 'Random' is a strong baseline due to class imbalance.
Serendipity-2018	Accuracy	0.505	0.876	+0.371
Serendipity-2018	F1-score	0.205	0.528	+0.323
Serendipity-2018	F1-score	0.485	0.528	+0.043
Serendipity-2018	Precision	0.129	0.207	+0.078

Main Takeaways

LLMs outperform random and 'all negative' baselines in Accuracy but struggle with Precision on the minority class (serendipitous items).
Including genres and ratings (Explicit w/ Genres) generally improves performance for GPT-4 compared to item names alone.
The 'unpopularity' metric (unpop) is a competitive heuristic baseline, sometimes outperforming complex LLM prompting in recall.
GPT-4 significantly outperforms GPT-3.5 and Llama2-13B, indicating model capacity is crucial for this nuanced task.

📚 Prerequisite Knowledge

Prerequisites

Recommender Systems basics (Collaborative Filtering, Matrix Factorization)
Large Language Models (In-context learning/Prompting)
Evaluation metrics (Precision, Recall, F1)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

Serendipity: In RS context, a recommendation that is both relevant (useful/liked) and unexpected to the user.

RS: Recommender Systems—algorithms designed to suggest relevant items to users.

LLM4Rec: The application of Large Language Models to Recommender Systems tasks.

SVD: Singular Value Decomposition—a matrix factorization technique used in collaborative filtering to predict missing ratings.

SOG: Serendipity-Oriented Greedy—a re-ranking algorithm designed to improve serendipity by balancing relevance, diversity, and unpopularity.

NLG: Natural Language Generation—producing text outputs from models.

Zero-shot/Few-shot: Providing the model with zero or a few examples in the prompt to guide its behavior without parameter updates.

Macro metrics: Averaging metrics (Precision/Recall) independently per class to treat minority and majority classes equally.