RobustExplain: Evaluating Robustness of LLM-Based Explanation Agents for Recommendation

📝 Paper Summary

Explainable Recommendation LLM Robustness Trustworthy AI

RobustExplain provides a systematic evaluation framework to measure how LLM-generated recommendation explanations change when user interaction histories are subjected to realistic noise like accidental clicks or missing metadata.

Core Problem

LLM-based explanation agents often generate inconsistent or unstable rationales when user interaction histories contain noise, undermining user trust even if the recommendation remains valid.

Why it matters:

Real-world interaction data is inherently noisy due to accidental clicks, shared accounts, and evolving preferences, unlike the clean data assumed in standard evaluations
Inconsistent explanations (e.g., changing reasoning based on a single accidental click) can erode user trust in recommender systems
Prior work focuses on fluency and relevance under static inputs, ignoring the critical dimension of stability under perturbation

Concrete Example: A user has a long history of buying sci-fi books. If a single 'accidental click' on a cooking pot is injected into their history (Noise Injection), an unstable explanation agent might suddenly justify a sci-fi recommendation by referencing 'kitchen preferences' or drastically change its reasoning structure, rather than robustly adhering to the dominant sci-fi pattern.

Key Novelty

Systematic Perturbation-Based Robustness Evaluation Framework

Defines a taxonomy of five realistic user behavior perturbations (e.g., noise injection, temporal shuffle) mapped to severity levels to simulate real-world data quality issues
Introduces a multi-dimensional metric combining semantic consistency, keyword stability, structural preservation, and length variation to quantify how explanations degrade under noise

Architecture

The RobustExplain evaluation framework workflow.

Evaluation Highlights

Current LLM explanation agents (7B–70B) show only moderate robustness, with average consistency scores around 0.50, indicating high sensitivity to noise
Larger models (e.g., LLaMA-3-70B) demonstrate up to ~8% higher stability than smaller counterparts like Qwen2.5-7B
Different perturbation types trigger distinct failure modes; models are particularly sensitive to noise injection compared to other noise types

Breakthrough Assessment

7/10

Establishes the first principled benchmark for explanation robustness in recommenders. While it doesn't propose a new model architecture to fix the problem, it exposes a critical flaw in current systems and provides the tools to measure it.

⚙️ Technical Details

Problem Definition

Setting: Generating natural language explanations for item recommendations based on user interaction history

Inputs: User interaction history H_u (items, timestamps, categories), recommended item r, and item feature matrix X

Outputs: Natural language explanation e justifying why r is recommended

Pipeline Flow

Input: Original History H_u → Generator E → Original Explanation e
Perturbation: H_u → Perturbation Function δ → Perturbed History H'_u
Input: Perturbed History H'_u → Generator E → Perturbed Explanation e'
Evaluation: Compare (e, e') using Multi-Dimensional Metrics

System Modules

Perturbation Module

Apply specific noise types (Noise Injection, Temporal Shuffle, etc.) at varying severity levels (1-5) to user history

Model or implementation: Rule-based functions (e.g., random sampling, shuffling)

Explanation Generator

Generate natural language justifications for recommendations

Model or implementation: Various LLMs (Qwen2.5-7B, LLaMA-3.1-8B, Qwen2.5-14B, LLaMA-3.1-70B)

Evaluator

Compute robustness metrics between original and perturbed explanations

Model or implementation: Metric formulas (Cosine Similarity, Jaccard, BLEU)

Novel Architectural Elements

Perturbation Taxonomy: A specific set of 5 noise types tailored to recommender system user histories (not generic text noise)
Multi-Dimensional Robustness Metric: A composite score specifically designed for recommendation explanations, weighting semantic and keyword stability over structure

Modeling

Base Model: Evaluated on Qwen2.5-7B, LLaMA-3.1-8B, Qwen2.5-14B, LLaMA-3.1-70B

Training Method: Zero-shot prompting (Inference-only evaluation)

Compute: Models deployed locally via Ollama; specific GPU hardware not reported in the paper

Comparison to Prior Work

vs. Standard Explanation Evaluation: RobustExplain measures stability under input perturbation rather than static quality
vs. NLP Robustness: Perturbations are semantically meaningful to user behavior (e.g., temporal shuffle) rather than surface-level text noise
vs. Recommendation Robustness: Focuses on the natural language explanation output rather than the item ranking output

Limitations

Evaluation relies on synthetic data, which may not fully capture the complexity of large-scale production logs
Scope limited to four specific LLMs; results may vary for closed-source models like GPT-4
Focuses on stability metrics but does not measure the 'correctness' of the explanation relative to ground truth user intent (only relative to the original explanation)
Does not propose a method to *improve* robustness, only a framework to *evaluate* it

Reproducibility

Code: https://github.com/GuilinDev/LLM-Robustness-Explain

Code publicly available at https://github.com/GuilinDev/LLM-Robustness-Explain. Dataset is a controlled synthetic e-commerce dataset (200 items, 100 users) designed for reproducibility. Models are standard open weights (LLaMA/Qwen) run via Ollama.

📊 Experiments & Results

Evaluation Setup

Controlled experiments on synthetic e-commerce dataset (200 items, 7 categories, 100 users)

Benchmarks:

RobustExplain Framework (Explanation Generation under Perturbation) [New]

Metrics:

Semantic Similarity (Sem)
Keyword Stability (Key)
Structural Consistency (Struct)
Length Stability (Len)
Weighted Robustness Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RobustExplain (Average Consistency)	Robustness Score	0.50	0.54	+0.04
RobustExplain	Stability Gain	Not reported in the paper	Not reported in the paper	+0.08

Main Takeaways

Current LLMs exhibit only moderate robustness (scores ~0.50), meaning explanations change significantly even with minor user history noise
There is a positive correlation between model size and robustness; 70B models are more stable than 7B models
Models are sensitive to specific types of noise: 'Noise Injection' (random items) tends to disrupt explanations more than 'Temporal Shuffle'
Metrics are complementary: Semantic similarity captures meaning, while keyword stability captures specific entity preservation

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Recommender Systems (user history, item features)
Familiarity with Large Language Models (LLMs) for text generation
Concepts of robustness and perturbation in machine learning

Key Terms

Noise Injection: Adding random interactions to a user history to simulate accidental clicks or exploratory browsing

Temporal Shuffle: Randomly permuting the order of interactions to simulate timestamp inaccuracies or delayed logging

Behavior Dilution: Injecting interactions from a user's least-engaged categories to simulate shared accounts or gift purchases

Category Drift: Replacing a fraction of interactions with items from different categories to simulate evolving user interests

Semantic Similarity (Sem): A metric measuring meaning preservation between original and perturbed explanations using bag-of-words cosine similarity

Keyword Stability (Key): Jaccard coefficient of key terms (nouns, product names) extracted from original and perturbed explanations

Structural Consistency (Struct): BLEU score measuring the preservation of explanation structure and phrasing patterns

Length Stability (Len): A measure of relative length preservation to detect dramatic changes in explanation verbosity