Evaluation Setup
Simulated dialogue with a UserBot that holds ground truth preferences and responds to system queries
Benchmarks:
- MovieLens-25M (Movie Recommendation)
- Amazon Books (Book Recommendation)
Metrics:
- MRR@10 (Mean Reciprocal Rank)
- SR@10 (Success Rate)
- AT (Average Turns)
- Statistical methodology: Averages over 50-100 dialogue simulations; standard deviations shown in plots but significance tests not explicitly reported textually
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparison against monolithic LLMs shows PEBOL's superior efficiency in identifying preferences within limited dialogue turns. |
| MovieLens-25M |
MRR@10 |
0.174 |
0.270 |
+0.096
|
| Amazon Books |
MRR@10 |
0.046 |
0.134 |
+0.088
|
| MovieLens-25M |
MRR@10 |
0.003 |
0.270 |
+0.267
|
Main Takeaways
- Formal Bayesian strategies (Thompson Sampling/UCB) significantly outperform monolithic LLM reasoning in cold-start preference elicitation
- PEBOL remains robust to user noise (simulated misunderstanding or vague answers) compared to baselines
- The approach scales better than context-stuffing methods because the LLM only sees the single targeted item description per turn
- Thompson Sampling (PEBOL-TS) generally outperforms UCB (PEBOL-UCB) in this conversational setting