← Back to Paper List

Large Language Models as Evaluators for Recommendation Explanations

Xiaoyu Zhang, Yishan Li, Jiayin Wang, Bowen Sun, Weizhi Ma, Peijie Sun, Min Zhang
Tsinghua University
arXiv (2024)
Recommendation Benchmark P13N

📝 Paper Summary

Explainable Recommendation LLM-as-a-Judge
LLMs can replace expensive manual annotation for evaluating recommendation explanations by correlating strongly with user perception when using specific prompting strategies and ensemble methods.
Core Problem
Evaluating recommendation explanations is difficult because quality criteria (persuasiveness, transparency) are subjective to the user, making standard metrics ineffective and manual annotation unscalable.
Why it matters:
  • Traditional reference-based metrics (BLEU, ROUGE) measure textual similarity, which often correlates poorly with human perception of explanation quality
  • Self-reported user feedback is accurate but difficult and expensive to collect for large public datasets
  • Third-party manual annotation is costly, time-consuming, and lacks scalability
Concrete Example: A recommendation explanation might be textually very different from a reference review (leading to a low BLEU score) yet still be highly 'Persuasive' or 'Transparent' to the user. Existing metrics fail to capture these subjective dimensions.
Key Novelty
LLM-based Evaluation Framework for Explanations
  • Proposes using LLMs (e.g., GPT-4) as evaluators for subjective recommendation aspects (Persuasiveness, Transparency) via zero-shot and one-shot prompting
  • Introduces a 3-level meta-evaluation strategy (Dataset-Level, User-Level, Pair-Level) to rigorously measure how well evaluators correlate with ground-truth user perceptions
  • Explores 'Personalized One-Shot' prompting, where the LLM is given a scoring example from the specific user to learn their individual bias
Architecture
Architecture Figure Figure 1
A conceptual workflow illustrating the sources of evaluation data: Ground Truth (User), Third-party Annotators, Reference-based Metrics, and the proposed LLM Evaluator.
Evaluation Highlights
  • GPT-4 provides evaluations comparable to or better than traditional third-party human annotations
  • Reference-based metrics like BLEU-4 show poor or negative correlation with user ground truth at the Dataset-Level
  • Ensembling evaluations from multiple heterogeneous LLMs improves the stability and accuracy of the assessment
Breakthrough Assessment
7/10
Provides a rigorous methodological framework (3-level meta-evaluation) for a subjective task. While LLM-as-a-judge is known, applying it to personal recommendation explanations with grounded user-study data is a valuable contribution.
×