Evaluation Setup
Simulated conversations between Seller (SalesBot or Human) and ShopperBot across 6 complex product categories (e.g., TVs, vacuums).
Benchmarks:
- SalesOps Simulation (Conversational Recommendation) [New]
Metrics:
- Recommendation Accuracy (Rec)
- Informativeness (Inf_e: entailment with guide, Inf_q: user quiz score)
- Fluency (Flu_e: Likert 1-5, Flu_i: Human/Bot classification)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Ablation studies demonstrate the necessity of LLMs in the Response Generation module and the benefits of generative query formulation. |
| SalesOps Simulation |
Fluency (Flu_e) |
1.41 |
4.99 |
+3.58
|
| SalesOps Simulation |
Recommendation Accuracy (Rec) |
0.36 |
0.44 |
+0.08
|
| Human evaluation comparing SalesBot against 15 professional salespeople shows comparable fluency but superior human recommendation performance. |
| SalesOps Simulation |
Recommendation Accuracy (Rec) |
44 |
54 |
+10
|
| SalesOps Simulation |
Fluency Score (Flu_e) |
4.2 |
4.4 |
+0.2
|
| SalesOps Simulation |
Information Quiz Score (Inf_q) |
31.8 |
32.9 |
+1.1
|
Main Takeaways
- SalesBot achieves high fluency and educational value comparable to professionals but struggles to close the gap in recommendation accuracy.
- Professional salespeople are less concise (half the word count) and use casual language, leading to lower fluency scores but higher human-detection rates.
- Faithfulness is a challenge for both AI and humans; humans intentionally hallucinate (upsell) or guess to facilitate sales, complicating the definition of 'alignment' in sales domains.