Enhancing Recommendation Diversity by Re-ranking with Large Language Models

📝 Paper Summary

Recommendation Re-ranking LLM for Recommender Systems

LLMs can effectively re-rank candidate recommendations to improve diversity by following zero-shot prompts, though they currently trade off more relevance and incur higher costs than traditional greedy algorithms.

Core Problem

Recommender systems often produce relevant but homogeneous lists of items, failing to offer meaningful choice or handle uncertainty.

Why it matters:

Pure relevance maximization ignores critical user satisfaction factors like novelty, serendipity, and fairness
Traditional diversity methods (greedy re-ranking) require explicit feature engineering and hyperparameter tuning
Existing LLM-based recommendation research focuses almost exclusively on relevance, neglecting beyond-accuracy objectives like diversity

Concrete Example: A relevance-optimized recommender might suggest 10 very similar 'Action' anime movies to a user. Traditional methods re-rank this list using mathematical formulas to mix genres. This paper tests if an LLM can simply be told 'produce a diverse ranking' and achieve a similar result without explicit feature engineering.

Key Novelty

Zero-Shot LLM-based Diversity Re-ranking

Frames the diversification problem as a text generation task where an LLM re-orders a candidate list based on natural language instructions
Designs specific prompts that guide the LLM to balance relevance and diversity without needing training data or explicit distance metrics
Introduces a methodology to handle LLM hallucinations (invalid items) in recommendation lists during evaluation

Architecture

The workflow of the proposed LLM-based diversity re-ranking approach.

Evaluation Highlights

ChatGPT enhances diversity metrics (e.g., +0.06 EILD) compared to the initial relevance-based ranking, though with a drop in relevance (-0.03 nDCG)
OpenAI models (ChatGPT, InstructGPT) consistently outperform Llama2 models in following re-ranking instructions and minimizing hallucinations
Feature-aware prompts (providing item genres) yield better trade-offs than simple high-level instructions

Breakthrough Assessment

4/10

First exploration of LLMs for diversity re-ranking. While promising, it does not yet beat traditional, faster greedy baselines, serving more as a feasibility study than a new SOTA.

⚙️ Technical Details

Problem Definition

Setting: Post-processing re-ranking of a candidate recommendation list

Inputs: A candidate list of items C_L generated by a baseline recommender (e.g., Matrix Factorization), ordered by relevance

Outputs: A re-ranked sub-list R_L of size n that aims to balance relevance and diversity

Pipeline Flow

Baseline Recommender (Generates candidate list)
Prompt Construction (Injects candidate list into template)
LLM Inference (Generates re-ranked list text)
Parsing & Validation (Extracts items, handles hallucinations)

System Modules

Baseline Recommender

Generate initial pool of relevant items

Model or implementation: Matrix Factorization (implicit feedback)

Prompt Generator (Re-ranking)

Construct the zero-shot prompt including task instructions and the candidate list

Model or implementation: Template-based string formatter

LLM Re-ranker (Re-ranking)

Generate a new ordering of items emphasizing diversity

Model or implementation: ChatGPT (gpt-3.5-turbo), InstructGPT (text-davinci-003), Llama2-7B-Chat, or Llama2-13B-Chat

Output Parser

Map text back to item IDs and filter invalid recommendations

Model or implementation: Regular expressions / String matching

Novel Architectural Elements

Prompt-based diversity controller: Using natural language instructions to replace the mathematical objective function (lambda parameter) typically used in greedy re-ranking

Modeling

Base Model: Evaluated multiple: ChatGPT (gpt-3.5-turbo), InstructGPT (text-davinci-003), Llama2-7B-Chat, Llama2-13B-Chat

Training Method: Zero-shot prompting (Inference only)

Compute: Not reported in the paper

Comparison to Prior Work

vs. MMR/xQuAD: LLM uses semantic understanding/internal knowledge vs. explicit distance metrics (e.g., Jaccard/Cosine on genres)
vs. MMR/xQuAD: LLM approach is 'black-box' regarding the trade-off, whereas greedy methods use an explicit lambda hyperparameter
vs. standard LLM Re-ranking (e.g., Sun et al. 2023): Focuses on diversity objective rather than just relevance [not cited in paper as comparison baseline, but related work]

Limitations

Lower performance than traditional greedy baselines on relevance-aware diversity metrics
High inference latency and cost compared to efficient greedy algorithms
Susceptibility to hallucinations (recommending items not in candidate set)
Sensitivity to prompt phrasing (requires prompt engineering)
Lack of explicit control over the relevance-diversity trade-off (no tunable parameter like lambda)

Reproducibility

Code: https://github.com/diegocarraro/LLM-diversity-reranking

📊 Experiments & Results

Evaluation Setup

Offline evaluation on two datasets (Anime, Books) using a Matrix Factorization candidate generator.

Benchmarks:

MyAnimeList (Anime) (Top-n Recommendation)
LibraryThing (Books) (Top-n Recommendation)

Metrics:

nDCG@10 (Relevance)
EILD (Expected Intra-List Diversity)
alpha-nDCG (Diversity-aware relevance)
S-Recall (Subtopic Recall)
Gini Index (Fairness/Aggregate Diversity)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of LLM re-rankers against the original candidate ranking (no re-ranking) and Random re-ranking on the Anime dataset.
Anime	nDCG@10	0.297	0.264	-0.033
Anime	EILD	0.339	0.399	+0.060
Anime	alpha-nDCG	0.380	0.395	+0.015
Comparison against state-of-the-art greedy re-rankers (MMR, xQuAD) which currently represent the 'ceiling' for performance.
Anime	alpha-nDCG	0.419	0.395	-0.024
Anime	EILD	0.426	0.399	-0.027
Model Comparison: OpenAI vs Llama models.
Anime	alpha-nDCG	0.370	0.395	+0.025

Main Takeaways

LLMs successfully interpret diversity instructions: they increase diversity metrics (EILD, S-Recall) compared to the initial relevance-only ranking.
Trade-off exists: LLM re-ranking reduces relevance (nDCG) to buy diversity, similar to traditional methods, but generally sits 'below' the Pareto frontier of optimized greedy methods like xQuAD.
Prompt quality matters: Templates that explicitly include item features (e.g., genres) help the LLM make better diversity decisions than abstract instructions.
Cost/Latency barrier: Traditional greedy methods take milliseconds; LLMs take seconds and incur token costs, making them currently impractical for real-time production without significant optimization.

📚 Prerequisite Knowledge

Prerequisites

Collaborative Filtering / Matrix Factorization
Evaluation metrics for Recommender Systems (nDCG, Precision)
Diversity metrics (Intra-List Diversity, alpha-nDCG)

Key Terms

Greedy re-ranking: An iterative process where items are selected one by one to maximize a combined score of relevance and diversity at each step

MMR: Maximal Marginal Relevance—a greedy strategy that selects items maximizing relevance while minimizing similarity to already selected items

xQuAD: Explicit Query Aspect Diversification—a diversity method that selects items to cover different user interests or item aspects

nDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that weights highly relevant items more when they appear earlier in the list

EILD: Expected Intra-List Diversity—a metric measuring the average pairwise distance between recommended items, weighted by their rank and relevance

Zero-shot prompting: Asking a model to perform a task without providing any example inputs and outputs in the prompt

Hallucination (in RS): When the LLM recommends items that were not in the candidate list or do not exist

Matrix Factorization: A technique that decomposes the user-item interaction matrix into lower-dimensional latent factors to predict missing ratings