Can Small Language Models be Good Reasoners for Sequential Recommendation?

📝 Paper Summary

Sequential Recommendation LLM for Recommendation Knowledge Distillation

SLIM distills step-by-step reasoning capabilities from a large teacher LLM into a smaller student model, generating dense rationale vectors that enhance traditional sequential recommenders.

Core Problem

Directly using Large Language Models (LLMs) for sequential recommendation is computationally expensive and high-latency, while traditional models lack the open-world reasoning capabilities to understand complex user behaviors.

Why it matters:

Real-world recommender systems require low-latency inference, making massive models like GPT-4 impractical to deploy directly for every user request
Traditional sequential models suffer from closed-loop data limitations (exposure bias), missing the broader context and reasoning ability inherent in LLMs
Existing methods that use LLMs as rankers or knowledge enhancers often ignore the intermediate reasoning steps that explain *why* a user might prefer an item

Concrete Example: A user's history might include various strategy games. A traditional model sees only IDs. A large LLM can reason 'User likes strategy -> recommend Civilization VI', but costs too much. SLIM distills this reasoning process so a small model can output 'User enjoys historical strategy games' as a dense vector to guide the recommender.

Key Novelty

Step-by-step knowLedge dIstillation fraMework (SLIM)

Uses Chain-of-Thought (CoT) prompting on a teacher LLM to generate macro-to-micro rationales (User Preference -> Category Interest -> Specific Items)
Distills this reasoning process into a smaller student model (LLaMA2-7B) by using the teacher's rationales as training labels, enabling the student to 'think' like the teacher
Encodes the student's generated text rationales into dense vectors that are fused with traditional ID-based or ID-agnostic recommendation backbones

Evaluation Highlights

Outperforms state-of-the-art baselines on three real-world datasets (Amazon Beauty, Sports, Toys), with significant gains in ID-agnostic settings
Student model (LLaMA2-7B) achieves reasoning capabilities comparable to teacher models 25x its size while using only 4% of the parameters
Generates meaningful natural language rationales that improve interpretability without the high inference cost of massive LLMs

Breakthrough Assessment

7/10

Novel application of CoT distillation specifically for sequential recommendation. Effectively bridges the gap between high-reasoning LLMs and efficiency-focused recommender systems, though the core technique is a standard distillation application.

⚙️ Technical Details

Problem Definition

Setting: Sequential Recommendation: Predict the next item a user will interact with based on their historical sequence of interactions.

Inputs: User interaction sequence S_u = [i_1, i_2, ..., i_{t-1}]

Outputs: Predicted probability scores for candidate items

Pipeline Flow

Teacher Reasoning: Teacher LLM (e.g., larger model) generates CoT rationales from user history
Distillation: Student LLM (LLaMA2-7B) is fine-tuned to generate these rationales
Inference: Student LLM generates rationale for user history
Encoding: Text encoder converts rationale into dense vector
Fusion: Rationale vector is combined with item/sequence embeddings in the recommendation backbone

System Modules

Teacher LLM

Generate high-quality reasoning rationales (CoT) based on user behavior sequences

Model or implementation: Not explicitly specified (implied to be GPT-3.5/4 or large LLaMA)

Student LLM

Generate reasoning rationales for user sequences during inference

Model or implementation: LLaMA2-7B

Text Encoder

Encode generated rationales and item descriptions into dense vectors

Model or implementation: BERT (or similar PLM)

Recommendation Backbone

Predict next item interaction probability

Model or implementation: SASRec or BERT4Rec

Novel Architectural Elements

Integration of distilled CoT rationale vectors into traditional sequential recommendation backbones via a dedicated fusion layer

Modeling

Base Model: LLaMA2-7B (Student)

Training Method: Supervised Fine-Tuning (Distillation)

Objective Functions:

Purpose: Minimize the difference between student generation and teacher-generated rationales.

Formally: Negative log-likelihood of conditional language modeling: L = - sum log P(r'_u,t | r'_u,<t, p'_u)

Adaptation: LoRA (Low-Rank Adaptation)

Training Data:

Subset of users U' sampled from total users U
Rationales generated by Teacher LLM for U' used as labels

Key Hyperparameters:

student_parameters: 4% of teacher size (implied comparison)
inference_cost_reduction: High (due to smaller model size)

Compute: Student model is LLaMA2-7B (requires approx 14GB GPU memory for fp16 inference, much less than 175B models)

Comparison to Prior Work

vs. TALLRec: SLIM uses LLM to generate intermediate rationales for a backbone, rather than using LLM as the final ranker
vs. LLM-Rec: SLIM distills reasoning into a smaller model for efficiency, rather than relying on large LLM API calls or heavy inference
vs. SASRec/BERT4Rec: SLIM injects open-world reasoning knowledge (rationales) into the closed-loop collaborative filtering process

Limitations

Dependency on the quality of teacher LLM rationales (hallucination risk)
Added latency compared to pure ID-based models due to text generation and encoding (though faster than large LLMs)
Requires maintaining two models (Reasoning LLM + Recommendation Backbone)

Reproducibility

Code availability is not provided. The paper describes using LLaMA2-7B and public datasets (Amazon Beauty, Sports, Toys). Specific prompt templates are provided in figures.

📊 Experiments & Results

Evaluation Setup

Sequential Recommendation on Amazon datasets

Benchmarks:

Amazon Beauty (Sequential Recommendation)
Amazon Sports (Sequential Recommendation)
Amazon Toys (Sequential Recommendation)

Metrics:

Hit@10
NDCG@10
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparison in ID-based setting showing SLIM enhances traditional backbones.
Amazon Sports	NDCG@10	0.0246	0.0267	+0.0021
Amazon Beauty	NDCG@10	0.0435	0.0461	+0.0026
Performance in ID-agnostic (transductive/inductive) settings, demonstrating strong generalization.
Amazon Toys	NDCG@10	0.0384	0.0441	+0.0057

Main Takeaways

SLIM consistently improves performance over baselines in both ID-based and ID-agnostic settings, validating the utility of distilled rationales.
The generated rationales provide effective open-world knowledge that complements the collaborative signals in traditional datasets.
Student model (LLaMA2-7B) successfully learns the step-by-step reasoning pattern of the teacher, generating high-quality text rationales.
ID-agnostic performance is particularly boosted, suggesting the natural language rationales help bridge the gap when specific item IDs are less informative or absent.

📚 Prerequisite Knowledge

Prerequisites

Sequential Recommendation (SASRec, BERT4Rec)
Chain-of-Thought (CoT) Prompting
Knowledge Distillation
Low-Rank Adaptation (LoRA)

Key Terms

CoT: Chain-of-Thought—a prompting strategy that encourages LLMs to generate intermediate reasoning steps before the final answer

Knowledge Distillation: A process where a smaller 'student' model is trained to reproduce the behavior or outputs of a larger 'teacher' model

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained model weights and injects trainable rank decomposition matrices

ID-based Recommendation: Recommender systems that rely primarily on unique item identifiers (IDs) and learn embeddings for them

ID-agnostic Recommendation: Recommender systems that rely on item features (like text descriptions) rather than specific IDs, allowing for better generalization to new items

ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a set of metrics used to evaluate automatic summarization and machine translation

BLEU: Bilingual Evaluation Understudy—a metric for evaluating the quality of text which has been machine-translated from one natural language to another