← Back to Paper List

On the Factual Consistency of Text-based Explainable Recommendation Models

Ben Kabongo, Vincent Guigue
AgroParisTech, Sorbonne Université
arXiv (2025)
Recommendation Factuality Benchmark P13N

📝 Paper Summary

Explainable Recommendation Factuality Evaluation Natural Language Generation
Current text-based explainable recommenders achieve high semantic similarity but fail at factual consistency, necessitating a new evaluation framework based on atomic statement extraction and verification.
Core Problem
State-of-the-art text-based explainable recommenders are evaluated on surface-level fluency (semantic similarity) rather than whether their explanations align with actual user preferences found in reviews.
Why it matters:
  • Existing metrics (BLEU, BERTScore) can be high even if the explanation hallucinates features or sentiments not present in the user's history
  • Explainable systems aim to build trust, but generating plausible yet factually incorrect justifications undermines system transparency and user confidence
  • Prior factuality metrics focus on coarse chunks or exact feature matching, missing fine-grained sentiment and topic alignment
Concrete Example: A model might generate a fluent explanation praising a camera's 'excellent low-light performance' (high BERTScore against a generic positive review) even if the specific user's actual review only discussed 'long battery life' and never mentioned low-light capabilities.
Key Novelty
Statement-Level Factuality Evaluation Framework
  • Constructs ground-truth explanations by using an LLM to extract atomic 'topic-sentiment' statements from user reviews, filtering out non-explanatory noise
  • Introduces fine-grained metrics that verify generated explanations against these atomic statements using both LLM-based verification and NLI (Natural Language Inference) entailment scoring
Evaluation Highlights
  • Models achieving high BERTScore F1 (0.81–0.90) exhibit alarmingly low factual precision (4.38%–32.88%)
  • Recall is consistently poor across all models, with the highest recall being only 29.86% (XRec on Beauty dataset)
  • NLI-based coherence metrics reveal that models frequently generate statements that directly contradict the ground truth (negative coherence scores)
Breakthrough Assessment
7/10
Exposes a critical flaw in current evaluation standards for explainable recommendation. The proposed pipeline and metrics provide a necessary correction, though the method relies heavily on LLMs which may introduce their own biases.
×