← Back to Paper List

Behavior Alignment: A New Perspective of Evaluating LLM-based Conversational Recommender Systems

Dayu Yang, Fumian Chen, Hui Fang
University of Delaware
arXiv (2024)
Recommendation Benchmark P13N

📝 Paper Summary

Conversational Recommender Systems (CRS) LLM Evaluation Human-AI Alignment
Behavior Alignment is a new metric for Conversational Recommender Systems that measures how closely an LLM's recommendation strategies match human strategies, revealing that current LLMs are often too passive.
Core Problem
LLM-based Conversational Recommender Systems often fail to proactively inquire about user preferences, rushing to recommend items unlike human recommenders who use complex information-seeking strategies.
Why it matters:
  • Current metrics (BLEU, Perplexity) measure text fluency but fail to capture the strategic behavior (e.g., inquiry vs. recommendation) crucial for effective recommendation.
  • LLMs' passive behavior leads to insufficient user preference data, resulting in lower recommendation accuracy and user satisfaction compared to human recommenders.
Concrete Example: In the INSPIRED dataset, human recommenders typically converse for 2.5 turns before making a recommendation to gather info. In contrast, GPT-3.5 and Llama-2 often rush to recommend immediately without asking clarifying questions, leading to poor suggestions.
Key Novelty
Behavior Alignment Metric & Implicit Estimation
  • Explicitly compares the distribution of 'recommendation strategies' (e.g., inquiry, encouragement, offer help) used by an LLM against those used by humans in the same context.
  • Introduces a classification-based method to estimate this alignment implicitly without costly human annotation, by training a classifier to predict if a model response and a human response share the same strategy.
Evaluation Highlights
  • Behavior Alignment achieves a Cohen's Kappa of 0.74 with human preference, significantly outperforming BLEU and DIST (which show minimal agreement).
  • The implicit classifier, trained with 'hard negatives', achieves over 93% accuracy in predicting strategy alignment on out-of-distribution data (ReDial dataset).
  • Human recommenders wait ~2.5 turns before recommending, whereas LLMs often recommend immediately; Behavior Alignment successfully quantifies this passivity.
Breakthrough Assessment
7/10
Addresses a critical blind spot in CRS evaluation (behavior/strategy vs. just text quality). The proposed metric correlates far better with human judgment than standard NLP metrics, though it relies on specific strategy taxonomies.
×