Evaluation Setup
Re-ranking task on Amazon Review Data subsets (Movies, Games, CD, Electronics)
Benchmarks:
- Amazon Review Data (2018) (Top-K Recommendation (Re-ranking))
Metrics:
- HIT@K (H@K)
- MAP@K (Mean Average Precision)
- NDCG@K (Normalized Discounted Cumulative Gain)
- Statistical methodology: Average of three repeated runs reported with standard deviation
Main Takeaways
- Small domain gaps are essential: Transferring between 'CD & Vinyl' and 'Movies & TV' (same sub-group) yields massive gains (+64.28% MAP@1), while transferring from 'Electronics' to 'Movies' (large gap) hurts performance.
- Guidance matters: Explicitly prompting the LLM with 'recommendation guidance' (shared features like genre) improves metrics compared to using history alone.
- History length trade-off: Very short history (20 items) is insufficient for confident transfer, but increasing history helps target domain recommendation up to a point.
- Model scale correlation: Performance gains are positively correlated with model size; GPT-4 significantly outperforms GPT-3.5 and smaller local models (Ollama) in utilizing cross-domain context.