Evaluation Setup
Human evaluation via Amazon Mechanical Turk involving both system users (who chatted with the bot) and third-party readers (who judged the reviews).
Benchmarks:
- User Satisfaction Survey (Human evaluation of system interaction) [New]
- Reader Helpfulness Evaluation (Pairwise comparison of generated vs. human reviews) [New]
Metrics:
- User rating of 'Fun'
- User rating of 'Burden'
- Amount of rewriting needed
- Reader preference (Helpful, Pros/Cons, Comprehensive)
- Statistical methodology: Mann–Whitney U test used for user enjoyment comparison (p < 0.05).
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Reader evaluations show the proposed system generates more helpful and comprehensive reviews than human writers or a fixed-question baseline. |
| Review Helpfulness |
Win Rate vs Human |
23 |
55 |
+32
|
| Review Comprehensiveness |
Win Rate vs Human |
23 |
60 |
+37
|
| Balanced Pros/Cons |
Win Rate vs Human |
16 |
63 |
+47
|
| User experience metrics reveal a trade-off: the dynamic system is more fun and produces better drafts, but is perceived as more burdensome due to latency. |
| Editing Effort |
% of users needing >50% rewrite |
38 |
27 |
-11
|
Main Takeaways
- Dynamic interviewing elicits more comprehensive information than fixed questionnaires, leading to reviews that readers find more helpful.
- Users find the interactive chat more 'fun' than filling out forms, but the latency of GPT-4 creates a perception of higher burden.
- Automated rating prediction based on the generated text aligns better with objective third-party assessments than with the users' own subjective ratings.
- System-generated reviews are consistently rated as more balanced (pros vs cons) compared to human-written reviews.