Evaluation Setup
Controlled experiments on synthetic e-commerce dataset (200 items, 7 categories, 100 users)
Benchmarks:
- RobustExplain Framework (Explanation Generation under Perturbation) [New]
Metrics:
- Semantic Similarity (Sem)
- Keyword Stability (Key)
- Structural Consistency (Struct)
- Length Stability (Len)
- Weighted Robustness Score
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| RobustExplain (Average Consistency) |
Robustness Score |
0.50 |
0.54 |
+0.04
|
| RobustExplain |
Stability Gain |
Not reported in the paper |
Not reported in the paper |
+0.08
|
Main Takeaways
- Current LLMs exhibit only moderate robustness (scores ~0.50), meaning explanations change significantly even with minor user history noise
- There is a positive correlation between model size and robustness; 70B models are more stable than 7B models
- Models are sensitive to specific types of noise: 'Noise Injection' (random items) tends to disrupt explanations more than 'Temporal Shuffle'
- Metrics are complementary: Semantic similarity captures meaning, while keyword stability captures specific entity preservation