Evaluation Setup
Expert-based and automated evaluation of three LLM recommenders (GPT-4, LLaMA-3.1, P5) and baselines across three domains
Benchmarks:
- MovieLens-1M (Movie Recommendation)
- Amazon Books (Book Recommendation)
- Yelp (Restaurant Recommendation)
Metrics:
- Human-Centered Score (HCS)
- Gini Coefficient (Popularity Bias)
- Explanation Quality (Likert 5-point)
- Interaction Naturalness (Likert 5-point)
- NDCG@10 / Hit Rate@10 (Traditional Baselines)
- Statistical methodology: Inter-rater reliability using Fleiss' kappa and Intraclass Correlation Coefficient (ICC)
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Bias analysis reveals that advanced LLMs introduce significant popularity bias compared to traditional methods. |
| Cross-domain average |
Gini Coefficient (Lower is better/fairer) |
0.58 |
0.73 |
+0.15
|
| Quality assessments by domain experts show GPT-4's strength in explanation and interaction. |
| Cross-domain average |
Explanation Quality (1-5 Scale) |
Not reported in the paper |
4.21 |
Not reported in the paper
|
| Cross-domain average |
Interaction Naturalness (1-5 Scale) |
Not reported in the paper |
4.35 |
Not reported in the paper
|
Main Takeaways
- Traditional accuracy metrics (NDCG) fail to capture the user experience benefits of LLMs, such as explanation and naturalness
- There is a quantifiable trade-off between language capability and fairness: GPT-4 has the best natural language performance but the worst popularity bias (highest Gini)
- HELM effectively exposes quality dimensions invisible to traditional evaluation, such as the trust-building capacity of detailed explanations versus the efficiency of collaborative filtering