Evaluation Setup
Simulation on DuRecDial 2.0 dataset across 4 domains (Movies, Music, Food, POI). Evaluation of both simulation fidelity and CRS performance.
Benchmarks:
- DuRecDial 2.0 (Movies) (Conversational Recommendation)
- DuRecDial 2.0 (Music, Food, POI) (Conversational Recommendation)
Metrics:
- Personality Simulation Consistency (Precision/Recall/F1 of predicted vs. injected traits)
- Success Rate (SR)
- General Success Rate (GSR)
- Success Conversational Rounds (SCR)
- Persuasiveness (PRS)
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Simulation Consistency |
Average F1 |
0.4823 |
0.7398 |
+0.2575
|
| DuRecDial (Movies) |
Success Rate (SR) |
0.4306 |
0.4856 |
+0.055
|
| DuRecDial (Movies) |
General Success Rate (GSR) |
0.5865 |
0.7284 |
+0.1419
|
| Human Evaluation |
Pearson Correlation |
N/A |
Moderate to Strong |
-
|
Main Takeaways
- LLMs can effectively simulate specific personality traits in CRS users, with stronger models (GPT-4o, LlaMA-3) showing high consistency.
- User personality significantly impacts recommendation success; Agreeable and Extroverted users are easier to recommend to, while Neurotic users are harder to persuade.
- Persuasion strategies universally improve CRS performance, but their effectiveness varies by personality (e.g., Conscientious users respond better to Logic/Credibility than others).