Evaluation Setup
Multi-turn conversational recommendation simulation (max 5 turns)
Benchmarks:
- ReDial (Movie conversational recommendation)
- OpenDialKG (Multi-domain conversational recommendation (movie subset used))
Metrics:
- Recall@1
- Recall@10
- Recall@50
- Success Rate per Turn
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
ฮ |
| Impact of Data Leakage Removal on ReDial: Comparing performance when excluding conversations with leakage in history and responses ('-Both' setting). |
| ReDial |
Recall@50 Drop |
Not reported in the paper |
Not reported in the paper |
-21.6%
|
| ReDial |
Recall@50 Drop |
Not reported in the paper |
Not reported in the paper |
-13.8%
|
| ReDial |
Recall@50 Drop |
Not reported in the paper |
Not reported in the paper |
-13.5%
|
| ReDial |
Recall@50 Drop |
Not reported in the paper |
Not reported in the paper |
-21.4%
|
| Impact of Data Leakage Removal on OpenDialKG: Larger drops observed compared to ReDial. |
| OpenDialKG |
Recall@50 Drop |
Not reported in the paper |
Not reported in the paper |
-39.1%
|
| OpenDialKG |
Recall@50 Drop |
Not reported in the paper |
Not reported in the paper |
-3.1%
|
Main Takeaways
- Data leakage in conversational history and simulator replies significantly inflates evaluation results; removing it causes performance drops of over 20% for many models.
- Models are 'history-dependent': success rates are very high in the first turn (using only history) but drop significantly in turns 2-5 when relying on simulator interaction.
- ChatGPT demonstrates superior robustness compared to specialized CRS models (KBRD, BARCOR), showing much smaller performance degradation when leakage is removed.
- A significant portion of simulator interactions are 'chit-chat' rather than goal-oriented 'ask' or 'recommend' intents, confusing the evaluation process.