Evaluation Setup
Dataset generation followed by downstream recommendation task evaluation
Benchmarks:
- MobileRec (Mobile App Recommendation)
- Yelp (Local Business Recommendation)
- Amazon Electronics (Consumer Electronics Recommendation)
Metrics:
- Hit@K
- NDCG@K
- Human Evaluation (Naturalness, Coherence, Groundedness)
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- ConvRecStudio successfully generated over 38,000 dialogs across three diverse domains (MobileRec, Yelp, Amazon), demonstrating scalability.
- A downstream cross-attention transformer model trained on this synthetic data consistently outperformed baselines (History-only, Dialog-only, Naive Fusion) across all datasets.
- The proposed unified model achieved a 10.9% improvement in Hit@1 on Yelp, highlighting the value of fusing collaborative history with conversational context.
- Human evaluators rated the synthetic dialogs as natural, coherent, and grounded, validating the effectiveness of the profile-driven, plan-constrained generation pipeline.