← Back to Paper List

Distillation Matters: Empowering Sequential Recommenders to Match the Performance of Large Language Model

Yu Cui, Feng Liu, Pengbo Wang, Bohao Wang, Heng Tang, Yi Wan, Jun Wang, Jiawei Chen
Zhejiang University, OPPO Research Institute, University of Electronic Science and Technology of China
arXiv (2024)
Recommendation Reasoning

📝 Paper Summary

Sequential Recommendation Knowledge Distillation Large Language Models (LLMs) for Recommendation
DLLM2Rec distills knowledge from slow, semantic-rich LLM recommenders to fast conventional models using importance-aware ranking weights and collaborative embedding adaptation to handle noisy teacher signals and semantic gaps.
Core Problem
LLM-based recommenders have prohibitive inference latency (hours vs. seconds), but distilling their knowledge to faster conventional models fails due to unreliable teacher predictions, huge capacity gaps, and divergent semantic spaces.
Why it matters:
  • LLMs like LLaMA2-7B take ~3 hours to serve 10k users, making them unusable for real-time industrial recommendation requiring sub-second responses
  • Direct distillation is harmful because LLMs often hallucinate or underperform conventional models (in >30% of cases), and their content-based embeddings do not align with collaborative ID-based spaces
Concrete Example: A conventional model recommends items based on purchase history IDs, while an LLM generates item titles based on semantic descriptions. Because these spaces are disjoint, forcing the conventional model to match the LLM's embeddings directly destroys its ability to capture collaborative signals, often leading to worse performance than training without distillation.
Key Novelty
Uncertainty-Aware & Collaborative Distillation (DLLM2Rec)
  • Filters 'bad' teacher knowledge by weighting distillation samples based on Teacher Confidence (does the LLM's generated text match the item?) and Teacher-Student Consistency (do they agree?)
  • Bridges the semantic gap not by forcing alignment, but by projecting teacher embeddings and adding a learnable 'collaborative offset' that preserves the student's ability to learn ID-based patterns
Evaluation Highlights
  • Achieves an average improvement of 47.97% across three typical sequential models (SASRec, CL4SRec, DROS)
  • Reduces inference latency from ~3 hours (LLaMA2-7B teacher) to seconds (student model), maintaining the speed of conventional recommenders
  • Identifies that LLM teachers underperform conventional baselines in >30% of cases, validating the need for the proposed importance-aware filtering
Breakthrough Assessment
7/10
Strong practical motivation addressing the critical bottleneck of LLM deployment in RecSys. The proposed selective distillation and collaborative offset offer a nuanced solution to the 'semantic gap' problem, though the core novelty is an assembly of known distillation techniques adapted for this domain.
×