← Back to Paper List

Do LLM-judges Align with Human Relevance in Cranfield-style Recommender Evaluation?

Gustavo Penha, Aleksandr V. Petrov, Claudia Hauff, Enrico Palumbo, Ali Vardasbi, Edoardo D'Amico, Francesco Fabbri, Alice Wang, Praveen Chandar, Henrik Lindstrom, Hugues Bouchard, Mounia Lalmas
Spotify
arXiv (2025)
Recommendation Benchmark P13N

📝 Paper Summary

Recommender Systems Evaluation LLM-as-a-Judge
Large Language Models can serve as reliable, scalable relevance judges for recommender systems, achieving high ranking agreement with humans in rigorous Cranfield-style setups where traditional historical splits fail.
Core Problem
Standard offline recommender evaluation using historical interaction splits suffers from severe sparsity (incomplete labels) and biases, while robust Cranfield-style human annotation is prohibitively expensive.
Why it matters:
  • Traditional train-test splits on historical logs yield unstable results due to exposure and popularity bias
  • Incomplete relevance labels (missing-not-at-random) mean valid recommendations are often penalized as errors
  • Creating high-quality 'gold standard' test collections like those in Information Retrieval costs thousands of dollars per dataset
Concrete Example: When evaluating recommender models on the ML-32M dataset using a standard 80-20 time-based split, less than 15% of the top-100 recommended items have relevance labels (Judged@100 < 15%), making it impossible to distinguish whether a model is bad or simply finding unrated good items.
Key Novelty
LLM-based Cranfield Evaluation for Recommendation
  • Adapts the Information Retrieval 'Cranfield paradigm' (pooling top results from many systems for exhaustive judgment) to Recommender Systems using LLMs instead of humans
  • Replaces expensive human assessors with a zero-shot LLM that considers long user interaction histories and rich item metadata to predict subjective preference
  • Demonstrates that LLM judges can replicate human-derived system rankings better than historical data splits can
Evaluation Highlights
  • LLM-judge achieves 0.87 Kendall’s τ correlation with human-based system rankings, comparable to agreement levels in text retrieval tasks
  • Traditional historical train-test splits show poor agreement with human-derived rankings (Kendall’s τ = 0.33), highlighting their unreliability
  • Cranfield-style pooling provides ~100% label completeness (Judged@100) for participating models, compared to <15% for historical splits
Breakthrough Assessment
7/10
Strong empirical validation of LLM-judges in a domain (recommendation) known for subjectivity, offering a scalable alternative to broken offline evaluation methodologies.
×