Evaluation Setup
Sequential next-item prediction on sparse datasets
Benchmarks:
- Amazon Games (Sequential Recommendation)
- Amazon Toys (Sequential Recommendation)
Metrics:
- Ranking Metrics (NDCG, Hit Ratio - implied by 'Top-K recommendations')
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Inference Latency (10k users) |
Time |
3 hours |
Seconds |
Huge reduction
|
Main Takeaways
- DLLM2Rec achieves an average performance improvement of 47.97% across three standard sequential models (SASRec, CL4SRec, DROS), enabling them to match or exceed LLM-based baselines.
- Directly using LLMs (Teacher) is not always superior; empirical analysis shows LLMs underperform conventional models in >30% of individual test cases, highlighting the risk of blind distillation.
- The 'semantic gap' is a major hurdle: teacher and student top-20 lists overlap by less than 3.15%, confirming they rely on fundamentally different signals (content vs. collaboration).
- Existing distillation methods (Hint, HTD) often degrade performance compared to the vanilla student model because they enforce alignment between incompatible semantic spaces.