Evaluation Setup
Evaluation of 9 different CRSs (e.g., KBRD, BARCOR, ChatGPT) on the CRSArena-Dial dataset.
Benchmarks:
- CRSArena-Dial (Conversational Recommendation)
- Topical-Chat (Open-domain Chit-chat)
- PersonaChat (Persona-conditioned Chit-chat)
Metrics:
- Spearman correlation
- Pearson correlation
- Kendall's Tau
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| CRSArena-Dial |
Spearman correlation |
0.784 |
0.902 |
+0.118
|
| CRSArena-Dial |
Spearman correlation |
0.413 |
0.501 |
+0.088
|
| CRSArena-Dial |
Spearman correlation |
0.411 |
0.505 |
+0.094
|
| Topical-Chat |
Spearman correlation |
0.360 |
0.456 |
+0.096
|
| PersonaChat |
Spearman correlation |
0.407 |
0.510 |
+0.103
|
Main Takeaways
- FACE consistently outperforms reference-based metrics (BLEU, ROUGE) and strong LLM-based baselines (G-Eval, GPTScore) across both turn-level and dialogue-level aspects.
- The method generalizes effectively to chit-chat domains (Topical-Chat, PersonaChat) despite being designed with CRS in mind.
- Ablation studies confirm that both particle decomposition and instruction optimization contribute significantly to performance.
- Qualitative analysis shows FACE provides interpretable insights, enabling the identification of specific issues like 'premature recommendations' or 'repetitive behavior' that single-score metrics miss.