Evaluation Setup
Ranking (simulated via candidates) and Re-ranking (refining traditional model outputs) on 4 datasets
Benchmarks:
- Four unnamed datasets (Recommendation (Ranking/Re-ranking))
Metrics:
- Hit Ratio (HR)
- NDCG
- APLT (Popularity Bias)
- Serendipity
- Candidate Position Bias (Eq 8)
- Hallucination Rate (String matching)
- Statistical methodology: Kolmogorov-Smirnov (K-S) test used to validate representativeness of small test samples
Main Takeaways
- LLMs generally perform better in the re-ranking setting compared to the ranking setting
- In ranking tasks, LLMs excel at handling shorter input histories (cold-start) and domains where they have prior knowledge
- LLMs exhibit substantial candidate position bias, often favoring items at the start of the prompt regardless of relevance
- Hallucination is a significant issue, with some models fabricating non-existent items much more frequently than others
- LLM-generated textual profiles can capture key patterns in user history, potentially improving recommendation explainability