← Back to Paper List

HELM: A Human-Centered Evaluation Framework for LLM-Powered Recommender Systems

Sushant Mehta
arXiv (2026)
Recommendation Benchmark P13N Factuality

📝 Paper Summary

Evaluation Methodologies Trustworthy Recommender Systems
HELM is a multidimensional evaluation framework that assesses LLM-powered recommenders on human-centered qualities like trust and fairness, revealing that superior language capabilities often correlate with increased popularity bias.
Core Problem
Current evaluations of LLM-powered recommenders rely on traditional accuracy metrics (like Hit Rate and NDCG) that fail to capture critical human-centered qualities such as explainability, trust, and fairness.
Why it matters:
  • Traditional metrics favor systems that recommend popular items over those that build user trust through transparent reasoning
  • LLMs introduce unique risks like hallucination and conversational biases that accuracy metrics cannot detect
  • There is no comprehensive framework to evaluate the trade-offs between natural language capabilities and ethical dimensions in recommendation
Concrete Example: A traditional collaborative filtering system might score high on accuracy by recommending a popular blockbuster, while an LLM recommender might suggest a niche independent film with a personalized explanation matching the user's mood. Traditional metrics punish the latter despite it potentially offering a superior, more trustworthy user experience.
Key Novelty
HELM (Human-centered Evaluation for LLM-powered recoMmenders)
  • Establishes five specific evaluation dimensions (Intent, Explanation, Interaction, Trust, Fairness) tailored for generative recommendation systems
  • Combines rigorous expert evaluation of natural language dialogues with automated proxy metrics (like Gini coefficients and faithfulness checks) to capture qualitative trade-offs
  • Uses a geometric mean aggregation to prevent high performance in one area (e.g., fluency) from masking failure in another (e.g., fairness)
Evaluation Highlights
  • GPT-4 exhibits significantly higher popularity bias (Gini coefficient 0.73) compared to traditional Neural Collaborative Filtering (0.58), indicating a trade-off between language capability and fairness
  • GPT-4 achieves high marks for Explanation Quality (4.21/5.0) and Interaction Naturalness (4.35/5.0) according to domain experts
  • The framework identifies that stronger language understanding in LLMs correlates with increased popularity bias across movie, book, and restaurant domains
Breakthrough Assessment
8/10
Addresses a critical gap in evaluating generative recommenders by moving beyond accuracy. The finding linking language capability to popularity bias is a significant insight for the field.
×