Exploring the Potential of LLMs for Serendipity Evaluation in Recommender Systems

📝 Paper Summary

LLM-based Evaluation Serendipity in Recommender Systems

LLMs, especially when enhanced with auxiliary user data and multi-agent strategies, can simulate user serendipity judgments more accurately than traditional proxy metrics in recommender systems.

Core Problem

Evaluating serendipity (unexpected relevance) is difficult because it is inherently subjective, and gold-standard user studies are costly while algorithmic proxy metrics often fail to align with real user perception.

Why it matters:

Serendipity is crucial for user satisfaction by breaking filter bubbles, but hard to measure at scale
Current proxy metrics rely on fixed assumptions (e.g., popularity, diversity) that gap significantly from actual user feelings
Existing research lacks a systematic validation of whether LLMs can effectively replace human annotators for this specific subjective metric

Concrete Example: Traditional metrics like SNPR heavily weight relevance; if a serendipitous item is unexpected (low relevance score), SNPR penalizes it, whereas a human user might rate it highly as a 'pleasant surprise'. LLMs need to capture this nuance.

Key Novelty

SerenEva (Serendipity Evaluation Framework)

Benchmarking LLMs directly against real user study data to validate their capability as 'user simulators' for serendipity ratings
Injecting specific auxiliary data (e.g., user curiosity personality traits, item similarity) into prompts to better model the subjective nature of serendipity
Using multi-LLM voting strategies to reduce variance and improve alignment with human labels

Architecture

The SerenEva framework workflow

Evaluation Highlights

LLMs (e.g., Qwen2.5-14B) surpass the best conventional proxy metric (SOG) by ~100% in Pearson correlation in zero-shot settings
Optimal LLM configuration achieves >20% Pearson correlation with human user study labels, establishing a new state-of-the-art for automated evaluation
Small models (Qwen2.5-7B) in few-shot settings can approach the performance of large models (72B) in zero-shot settings

Breakthrough Assessment

7/10

Strong empirical validation of LLMs as superior evaluators for a subjective metric. While not a new model architecture, it establishes a reliable paradigm for evaluating serendipity without expensive user studies.

⚙️ Technical Details

Problem Definition

Setting: Given a user u, their history H, and a recommended item i, predict the serendipity score s(i, u) on a 5-point Likert scale.

Inputs: User history (list of items), recommended item metadata, optional auxiliary data (demographics, personality traits)

Outputs: Integer rating (1-5) representing the level of serendipity (pleasant surprise)

Pipeline Flow

Data Preparation (User History + Item Metadata)
Prompt Construction (incorporating auxiliary data)
LLM Inference (User Simulation)
Score Extraction & Scaling
Meta-Evaluation (Comparison with Human Ground Truth)

System Modules

Prompt Constructor

Formats user history and item data into a persona-based prompt

Model or implementation: N/A (Prompt Engineering)

User Simulator

Predicts a serendipity rating (1-5) based on the prompt

Model or implementation: Various (Qwen2.5, GPT-4, LLaMA2)

Score Normalizer

Maps raw model/metric outputs to the 5-point Likert scale for fair comparison

Model or implementation: Rule-based

Novel Architectural Elements

Integration of psychological auxiliary data (Curiosity, Personality Traits) directly into the evaluation prompt context to model subjective serendipity

Modeling

Base Model: Qwen2.5 (7B, 14B, 32B, 72B), GPT-4, LLaMA2-13B

Compute: Inference only. Low temperature setting (0.00001). Results averaged over five runs.

Comparison to Prior Work

vs. SOG/SNPR/PURS/DESR: Uses LLMs as semantic user simulators rather than rigid mathematical formulas
vs. SerenPrompt: SerenEva is designed for evaluation (user simulation) rather than recommendation optimization; outperforms SerenPrompt in alignment with human ratings
vs. LLM4Seren: SerenEva includes explicit definitions and auxiliary data injection, leading to better performance
+ 1 more
vs. GPT-Eval [not cited in paper]: General purpose evaluator, whereas SerenEva focuses specifically on the subjective metric of serendipity with domain-specific auxiliary data

Limitations

LLaMA family models performed poorly on Chinese dataset (Taobao) due to language barriers
Optimal auxiliary data types are domain-dependent (e.g., curiosity works better for e-commerce than movies)
Evaluation is limited to offline datasets; online A/B testing impact is not measured
Cost of using proprietary LLMs (like GPT-4) for large-scale evaluation can be high compared to proxy metrics

Reproducibility

Code: https://github.com/Leah-HKBU/SerenEva

Code publicly available at https://github.com/Leah-HKBU/SerenEva. Uses public datasets (Taobao Serendipity, Serendipity-2018). Prompt templates provided in codebase. Specific model weights for open-source models (Qwen, LLaMA) are standard HuggingFace releases.

📊 Experiments & Results

Evaluation Setup

Meta-evaluation comparing evaluator predictions against human ground truth ratings

Benchmarks:

Taobao Serendipity (E-commerce serendipity rating)
Serendipity-2018 (MovieLens) (Movie recommendation serendipity rating)

Metrics:

Pearson Correlation Coefficient
Mean Absolute Error (MAE)
Root Mean Squared Error (RMSE)
Statistical methodology: Two-sided t-test with p<0.05

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of LLMs against conventional proxy metrics (SOG, SNPR, etc.) on Taobao dataset.
Taobao Serendipity	Pearson Correlation	0.0521	0.1050	+0.0529
Taobao Serendipity	Pearson Correlation	0.0521	0.1384	+0.0863
Impact of using auxiliary data and multi-LLM techniques.
Taobao Serendipity	Pearson Correlation	0.1384	0.2150	+0.0766

Main Takeaways

Zero-shot LLMs perform comparably to or better than the best conventional proxy metrics (SOG) in aligning with human serendipity judgments.
Auxiliary data improves performance, but effectiveness is domain-dependent: 'Curiosity' helps in e-commerce (Taobao) but 'Openness' is less effective there; 'Similarity' helps in Movies.
Multi-LLM collaboration (averaging scores from multiple models) consistently reduces variance and improves correlation with human ratings.
Small models (7B) are viable: Qwen2.5-7B with few-shot examples approaches the performance of much larger models, offering a cost-effective evaluation path.

📚 Prerequisite Knowledge

Prerequisites

Recommender Systems metrics (NDCG, Serendipity)
Large Language Models (Prompting, Few-shot learning)
Pearson Correlation Coefficient

Key Terms

Serendipity: A recommendation quality measuring items that are both unexpected and relevant (pleasant surprise)

Proxy metrics: Algorithmic formulas (e.g., SOG, SNPR) used to estimate serendipity when human feedback is unavailable

Zero-shot: Prompting the LLM to perform a task without providing any specific examples of that task in the context

Few-shot: Prompting the LLM with a small set of example inputs and outputs to guide its generation

SOG: Serendipity-Oriented Greedy—a baseline proxy metric combining relevance, diversity, and unpopularity

SNPR: Serendipity-oriented Next POI Recommendation—a baseline metric emphasizing relevance and unexpectedness

SerenEva: The proposed meta-evaluation framework for assessing how well evaluators align with human serendipity judgments

Pearson correlation: A statistic measuring linear correlation between two sets of data (here, predicted scores vs. human ratings)

MAE: Mean Absolute Error—average of absolute differences between prediction and ground truth

RMSE: Root Mean Squared Error—measure of differences between values predicted by a model and the values observed