Evaluation Setup
Conversational recommendation tasks including preference elicitation, explanation, and direct recommendation.
Benchmarks:
- ELM 24 Tasks (Conversational Recommendation (Explanation, Critiquing, etc.))
- OpenP5 (Sequential and Straightforward Recommendation)
Metrics:
- Log Perplexity
- Semantic Consistency (SC)
- NDCG@10
- Hit Rate@10 (HR@10)
- Statistical methodology: Standard error reported
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance on ELM 24 Tasks (lower perplexity is better, higher SC is better). ILM outperforms the MLP baseline and random initialization. |
| ELM 24 Tasks |
Log Perplexity |
-1.854 |
-1.728 |
+0.126
|
| ELM 24 Tasks |
Semantic Consistency (SC) |
0.781 |
0.796 |
+0.015
|
| Performance on OpenP5 Recommendation Tasks. ILM consistently beats baselines on ranking metrics. |
| OpenP5 (MovieLens-1M) |
NDCG@10 |
0.1983 |
0.2081 |
+0.0098
|
| OpenP5 (Beauty) |
NDCG@10 |
0.0612 |
0.0658 |
+0.0046
|
| OpenP5 (Clothing) |
NDCG@10 |
0.0573 |
0.0632 |
+0.0059
|
Main Takeaways
- The Q-Former based item encoder (ILM) significantly outperforms simple MLP projections (CoLLM) for integrating collaborative filtering signals into LLMs.
- Pre-training with Item-Text alignment and Item-Item contrastive learning is crucial; ILM with random initialization (ILM-rand) performs worse than the full ILM.
- The method works across diverse tasks (conversational, sequential recommendation) and domains (movies, beauty, clothing), showing robust generalization.
- Freezing the LLM and only training the adapter effectively preserves the model's language capabilities while imparting recommendation knowledge.