On the Factual Consistency of Text-based Explainable Recommendation Models

📝 Paper Summary

Explainable Recommendation Factuality Evaluation Natural Language Generation

Current text-based explainable recommenders achieve high semantic similarity but fail at factual consistency, necessitating a new evaluation framework based on atomic statement extraction and verification.

Core Problem

State-of-the-art text-based explainable recommenders are evaluated on surface-level fluency (semantic similarity) rather than whether their explanations align with actual user preferences found in reviews.

Why it matters:

Existing metrics (BLEU, BERTScore) can be high even if the explanation hallucinates features or sentiments not present in the user's history
Explainable systems aim to build trust, but generating plausible yet factually incorrect justifications undermines system transparency and user confidence
Prior factuality metrics focus on coarse chunks or exact feature matching, missing fine-grained sentiment and topic alignment

Concrete Example: A model might generate a fluent explanation praising a camera's 'excellent low-light performance' (high BERTScore against a generic positive review) even if the specific user's actual review only discussed 'long battery life' and never mentioned low-light capabilities.

Key Novelty

Statement-Level Factuality Evaluation Framework

Constructs ground-truth explanations by using an LLM to extract atomic 'topic-sentiment' statements from user reviews, filtering out non-explanatory noise
Introduces fine-grained metrics that verify generated explanations against these atomic statements using both LLM-based verification and NLI (Natural Language Inference) entailment scoring

Evaluation Highlights

Models achieving high BERTScore F1 (0.81–0.90) exhibit alarmingly low factual precision (4.38%–32.88%)
Recall is consistently poor across all models, with the highest recall being only 29.86% (XRec on Beauty dataset)
NLI-based coherence metrics reveal that models frequently generate statements that directly contradict the ground truth (negative coherence scores)

Breakthrough Assessment

7/10

Exposes a critical flaw in current evaluation standards for explainable recommendation. The proposed pipeline and metrics provide a necessary correction, though the method relies heavily on LLMs which may introduce their own biases.

⚙️ Technical Details

Problem Definition

Setting: Given a user-item interaction with a rating and review, predict the rating and generate a textual rationale that is factually consistent with the review's explanatory content.

Inputs: User u, Item i, Rating r_ui, Review t_ui (used for ground truth construction)

Outputs: Generated textual explanation e_ui

Pipeline Flow

Topic Definition (Domain-specific)
Statement Extraction (LLM-based)
Ground Truth Construction (Rule-based)
Explanation Generation (Various Baselines)
Factuality Evaluation (LLM & NLI)

System Modules

Statement Extractor (Ground Truth Generation)

Extracts atomic statements, topics, and sentiments from raw reviews

Model or implementation: Llama-3-8B-Instruct

Ground Truth Aggregator (Ground Truth Generation)

Synthesizes isolated statements into a coherent paragraph to serve as the factual reference

Model or implementation: Rule-based

Evaluator (LLM-based) (Evaluation)

Assesses if a statement is supported by the target explanation

Model or implementation: Llama-3.1-8B-Instruct

Evaluator (NLI-based) (Evaluation)

Computes entailment/contradiction probabilities between statement pairs

Model or implementation: DeBERTa-large-mnli

Novel Architectural Elements

Statement-level ground truth construction: Unlike previous methods that use raw reviews or summaries, this creates a synthetic reference containing ONLY verifiable explanatory statements.

Modeling

Base Model: Evaluated baselines: NRT, Att2Seq, PETER, CER, PEPLER, XRec (Llama-2-7b). Evaluation models: Llama-3.1-8B-Instruct and DeBERTa-large-mnli.

Compute: Not reported in the paper

Comparison to Prior Work

vs. BERTScore/BLEURT: Focuses on verifiable atomic facts rather than broad semantic overlap
vs. DIV/FCR [cited in paper]: Assesses sentiment and reasoning, not just feature mention matching
vs. FactScore [not cited in paper]: Adapts the atomic fact decomposition specifically for recommendation domains (topic/sentiment triplets) rather than general biography generation

Limitations

Evaluation relies on LLMs (Llama-3), which may introduce their own hallucinations or biases during statement extraction and scoring
Ground truth construction is synthetic and rule-based, potentially losing nuances of human writing style
NLI metrics are asymmetric and computationally expensive for pairwise comparisons
Requires domain-specific topic lists to be defined beforehand

Reproducibility

Code: https://github.com/BenKabongo25/factual_explainable_recommendation

Code, datasets, and illustrations available at https://github.com/BenKabongo25/factual_explainable_recommendation. Uses public Amazon Reviews dataset. Prompts for extraction are described in the paper.

📊 Experiments & Results

Evaluation Setup

Explainable recommendation on Amazon Reviews categories. Models generate explanations which are compared against constructed statement-level ground truths.

Benchmarks:

Amazon Reviews (Toys, Clothes, Beauty, Sports, Cellphones) (Explanation Generation)

Metrics:

Statement-to-Explanation Precision (St2Exp-P)
Statement-to-Explanation Recall (St2Exp-R)
Statement-to-Explanation F1 (St2Exp-F1)
Statement Entailment Precision (StEnt-P)
Statement Coherence (StCoh)
BERTScore F1
BLEURT
Statistical methodology: Standard deviations reported across samples.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
High semantic similarity scores across models contrast sharply with low factual consistency scores.
Amazon Sports	BERTScore F1	Not reported in the paper	0.90	Not reported in the paper
Amazon Sports	St2Exp-P	5.38	32.88	+27.50
Amazon Beauty	St2Exp-R	13.08	29.86	+16.78
Amazon Toys	St2Exp-P	16.63	25.27	+8.64
Amazon Cellphones	StEnt-P	20.77	24.80	+4.03
Amazon Clothes	StCoh-P	-1.04	3.31	+4.35

Experiment Figures

Radar chart comparing BERTScore F1 vs Statement-level Precision (St2Exp-P) across 6 models on the Amazon Sports dataset.

Main Takeaways

Massive gap between semantic similarity (BERTScore ~0.9) and factual accuracy (<33%), indicating current models hallucinate explanations that look good but are factually unsupported.
LLM-based models (XRec) generally outperform RNN/Transformer baselines in factuality, but still fail to recover the majority of ground truth statements (Recall < 30%).
NLI metrics show that models often generate contradictory statements (negative coherence scores), a failure mode not captured by standard similarity metrics.
Transformer-based models (PEPLER) excel at mimicking review style (high BLEURT) but do not necessarily ensure factual consistency.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with recommender systems and explainable recommendation
Understanding of Natural Language Generation (NLG) metrics (BLEU, BERTScore)
Basic knowledge of Natural Language Inference (NLI)

Key Terms

Atomic statement: A polarized fact expressing a user's opinion about a single attribute or topic of an item (e.g., 'battery is long-lasting')

NLI: Natural Language Inference—determining if a hypothesis is entailed by, contradicts, or is neutral to a premise

BERTScore: A metric evaluating text generation quality by computing similarity between contextual embeddings of candidate and reference texts

Hallucination: Generated content that is grammatically fluent but factually incorrect or unsupported by the input source

Triplets: Structured representations consisting of (statement, topic, sentiment) extracted from reviews

St2Exp-P: Statement-to-Explanation Precision—measures what proportion of statements extracted from the ground truth are supported by the generated explanation

StEnt-P: Statement Entailment Precision—an NLI-based metric calculating the entailment probability between generated statements and ground truth statements

Coherence score: The difference between entailment probability and contradiction probability in NLI evaluation