RecSys Arena: Pair-wise Recommender System Evaluation with Large Language Models

📝 Paper Summary

Recommender Systems Evaluation LLM-as-a-Judge

RecSys Arena utilizes large language models to simulate users and perform pair-wise comparative evaluations of recommender systems, offering fine-grained feedback that aligns better with user preferences than traditional point-wise metrics.

Core Problem

Traditional offline evaluation metrics (like AUC) often fail to capture subtle differences in user satisfaction and are inconsistent with online A/B test results, while online testing is risky and slow.

Why it matters:

Offline metrics like AUC are not sufficiently sensitive to distinguish the real quality of competitive recommender systems
Mainstream metrics such as CTR do not fully reflect long-term user experience or satisfaction
Existing LLM-based evaluations mostly focus on absolute point-wise scoring, which lacks the context needed for granular comparison

Concrete Example: Two competitive recommender systems might have nearly identical AUC scores on a dataset, yet one might offer significantly more diverse or serendipitous items that a user would prefer. Standard metrics miss this nuance, whereas a pair-wise LLM judge can articulate the preference based on the user profile.

Key Novelty

LLM-based Pair-wise Relative Evaluation for RecSys

Instead of asking an LLM to score a single recommendation list (point-wise), the system presents two lists side-by-side to an LLM simulator playing the role of a specific user
Leverages LLM's strong reasoning and role-play capabilities to determine a winner between two models based on user profiles and history, similar to Chatbot Arena but for recommendations

Architecture

Overview of RecSys Arena methodology

Evaluation Highlights

LLM-based pair-wise evaluation results align with trends observed in offline metrics like AUC and Diversity when comparing recommendation models
Proposed method effectively distinguishes between recommendation algorithms that have comparable performance in terms of traditional AUC and nDCG metrics
Evaluation on MovieLens and MIND datasets confirms that larger LLMs generally provide better evaluation effectiveness

Breakthrough Assessment

7/10

Applies the successful 'Arena' concept from NLP to RecSys. While the underlying idea of using LLMs for evaluation isn't new, the specific pair-wise framework for ranking recommendation lists addresses a key limitation of point-wise scoring.

⚙️ Technical Details

Problem Definition

Setting: Pair-wise comparison of two recommender system outputs given a specific user profile and history

Inputs: User attribute information S, viewing history H, and two recommendation lists I_RA and I_RB from systems RA and RB

Outputs: A judgment of which list is better (Win/Tie/Lose), qualitative analysis, and scores across 6 specific dimensions

Pipeline Flow

User Simulation Setup: Extract user profiles and history
Prompt Construction: Combine user data, two recommendation lists, and evaluation criteria
LLM Inference: LLM generates analysis and judgment
Result Aggregation: Calculate win rates (Quantile metric)

System Modules

User Profiler (Input Processing)

Construct textual representation of user interests

Model or implementation: Rules/Template-based extraction

Prompt Constructor (Input Processing)

Integrate all context into a structured prompt for the LLM

Model or implementation: Template-based

LLM Judge

Simulate user and evaluate relative preference between lists

Model or implementation: Various LLMs (e.g., GPT-4, Llama-3, etc.)

Novel Architectural Elements

RecSys Arena framework: A specific prompt engineering architecture that combines user role-playing with simultaneous pair-wise comparison of full recommendation lists (ranking lists vs. lists) rather than single items.

Modeling

Base Model: Various open and closed source models (8B to 236B parameters)

Comparison to Prior Work

vs. Chatbot Arena: Adapts the arena concept to Recommender Systems using simulated users instead of human crowds
vs. Zhang et al.: Uses pair-wise relative ranking instead of absolute point-wise scoring to improve sensitivity
vs. Wang et al.: Focuses on comparative evaluation of recommendation lists rather than conversational interaction
+ 1 more
vs. Traditional Metrics (AUC/nDCG): Provides qualitative reasoning (

Limitations

Dependence on LLM capabilities; smaller models may struggle with reasoning
Potential position bias in pair-wise evaluation (though often mitigated by swapping order)
Inference cost of using large LLMs for evaluation on large datasets
Limited to offline evaluation; while it aims to proxy online preferences, it is still a simulation

Reproducibility

Code: https://github.com/anonyProjects/RecSys-Arena

📊 Experiments & Results

Evaluation Setup

Offline evaluation on public datasets using LLMs as judges

Benchmarks:

MovieLens (Movie Recommendation)
MIND (News Recommendation)

Metrics:

Quantile Q (Degree of victory/Win rate)
Consistency with offline metrics (AUC, Diversity)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The following results summarize the findings qualitatively as specific numeric tables were not provided in the text for extraction.
MovieLens / MIND	Alignment with AUC/Diversity	Not reported in the paper	Not reported in the paper	Not reported in the paper
MovieLens / MIND	Evaluation Effectiveness	Not reported in the paper	Larger LLMs (236B)	Not reported in the paper

Experiment Figures

The specific prompt template used for evaluation

Main Takeaways

LLMs can generate reasonable pair-wise evaluations that align with traditional offline metrics like AUC and Diversity.
Pair-wise evaluation offers better discrimination power than point-wise metrics, capable of distinguishing between models with very similar AUC scores.
Qualitative feedback from LLMs provides actionable insights (e.g., regarding 'Inspiration' or 'Transparency') that numeric metrics cannot capture.
The framework supports evaluating dimensions usually hard to measure offline, such as 'Impact on users' and 'Inspiration'.

📚 Prerequisite Knowledge

Prerequisites

Recommender Systems (Basic concepts like CTR, AUC)
Large Language Models (In-context learning, Role-playing)
Evaluation Methodologies (A/B testing, Offline metrics)

Key Terms

RecSys: Recommender Systems—algorithms designed to suggest relevant items to users

AUC: Area Under the ROC Curve—a standard offline metric measuring the probability that a random positive sample is ranked higher than a random negative one

nDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that takes into account the position of relevant items

CTR: Click-Through Rate—the ratio of users who click on a specific link to the number of total users who view a page, email, or advertisement

Pair-wise evaluation: Comparing two items or lists directly against each other to determine which is better, rather than scoring them individually

Point-wise ranking: Assigning an absolute score to a single item or list in isolation

Zero-shot: Using a model to perform a task without providing any specific training examples for that task

Chain-of-Thought: A prompting technique that encourages the model to generate intermediate reasoning steps before producing the final answer