Pre-train, Align, and Disentangle: Empowering Sequential Recommendation with Large Language Models

📝 Paper Summary

Sequential Recommendation LLM for Recommendation Multi-modal alignment

PAD enhances sequential recommendation by aligning LLM semantic embeddings with ID-based collaborative embeddings using characteristic kernels, then disentangling them via a frequency-aware triple-expert architecture.

Core Problem

Existing LLM-enhanced recommenders suffer from high inference latency, fail to capture full data distribution statistics due to non-characteristic alignment kernels, and experience catastrophic forgetting where alignment degrades collaborative knowledge.

Why it matters:

Commercial systems require low latency (hundreds of items in milliseconds), making direct LLM inference impractical
Standard contrastive alignment (e.g., InfoNCE with cosine kernels) misses higher-order statistical dependencies between modalities
Simply forcing text embeddings to match ID embeddings destroys the unique semantic information LLMs provide, hurting performance on cold items

Concrete Example: When a new item appears (cold start), ID-based models have no interaction history and fail. Standard alignment methods try to map its text description to the ID space but lose the semantic nuance or overwrite the ID model's collaborative patterns, leading to suboptimal recommendations for both fresh and popular items.

Key Novelty

Pre-train, Align, and Disentangle (PAD) Framework

Uses Multi-Kernel Maximum Mean Discrepancy (MK-MMD) with characteristic kernels (Gaussian) to align text and ID distributions, ensuring all statistical moments are matched unlike cosine-based alignment
Introduces a 'rec-anchored' alignment loss that keeps the ID model frozen to the recommendation task during alignment, preventing the text alignment from corrupting collaborative knowledge (catastrophic forgetting)
Deploys a Triple-Expert Mixture-of-Experts (MoE) at inference: one expert for ID features, one for text features, and one for aligned features, gated by item frequency to handle cold vs. popular items dynamically

Architecture

The three-phase PAD framework: (1) Pre-training SASRec and LLM, (2) Alignment via MK-MMD and Rec Loss, (3) Disentangled Triple-Expert Fine-tuning with Gating.

Evaluation Highlights

Outperforms state-of-the-art baselines (e.g., CTRL, RLMRec) by up to 8.54% on HR@10 across three datasets (Sports, Beauty, Toys)
Significantly improves cold-start performance, boosting NDCG@10 by ~6-13% on the Beauty dataset compared to the strongest baseline
Eliminates catastrophic forgetting: ID-only performance drops by only ~1% after alignment compared to ~35% drops in contrastive approaches like CLIP

Breakthrough Assessment

7/10

Solid methodological improvement addressing specific weaknesses in LLM-Rec alignment (forgetting, distribution matching). The triple-expert design is a practical solution for the cold-start/warm-start trade-off.

⚙️ Technical Details

Problem Definition

Setting: Sequential Recommendation (SR) with multi-modal inputs (ID sequences + Text descriptions)

Inputs: User interaction sequence S = {i_1, ..., i_l} containing item IDs and associated text descriptions

Outputs: Probability distribution over the next item i_{l+1}

Pipeline Flow

Pre-training Phase: Train SASRec on IDs and LLM2Vec on Text separately
Alignment Phase: Train projection mapping from Text to ID space using MK-MMD + BCE loss
Disentanglement Phase (Inference): Fuse ID expert, Text expert, and Alignment expert via Frequency-Aware Gating

System Modules

Text Encoder (Input Processing)

Extract semantic embeddings from item text

Model or implementation: LLM2Vec (based on Llama-3-8B)

ID Encoder (Input Processing)

Extract collaborative embeddings from item IDs

Model or implementation: SASRec

Alignment Expert

Bridge semantic and collaborative spaces

Model or implementation: MLP Projector

Frequency-Aware Gating

Dynamically weight experts based on item popularity

Model or implementation: Learnable Gating Network

Novel Architectural Elements

Triple-expert architecture (ID-specific, Text-specific, Alignment-specific) with disentangled embedding tables
Frequency-aware gating mechanism explicitly conditioning expert selection on item popularity buckets
Recommendation-anchored characteristic alignment module using MK-MMD combined with a supervised recommendation loss

Modeling

Base Model: Llama-3-8B (for text encoding) + SASRec (for SR backbone)

Training Method: Three-stage training: (1) Pre-train individual modalities, (2) Align via MK-MMD, (3) Fine-tune MoE with Rec loss

Objective Functions:

Purpose: Measure distance between text and ID distributions in RKHS to align them.

Formally: L_MMD = || mu_p - mu_q ||^2_H_k
Purpose: Maintain recommendation performance during alignment to prevent forgetting.

Formally: L_Rec = BCE(Prediction, Label)
Purpose: Overall Alignment Loss.

Formally: L = L_Rec + gamma * L_MMD

Key Hyperparameters:

gamma: 0.1
batch_size: 2048
learning_rate: 0.001
+ 3 more
hidden_size: 64 (SASRec), 4096 (Llama-3)
max_sequence_length: 50
mmd_kernels: 5 Gaussian kernels with bandwidths [1, 2, 4, 8, 16]

Compute: Not reported in the paper

Comparison to Prior Work

vs. CTRL/RLMRec: Uses characteristic MK-MMD instead of InfoNCE; disentangles experts instead of simple fusion
vs. TALLRec: Uses embeddings for efficient retrieval rather than slow generative inference
vs. Universal-Multi-Modal-Rec [not cited in paper]: PAD specifically targets the forgetting problem in alignment via the anchored loss, whereas UMMR often focuses on architecture

Limitations

Reliance on pre-trained LLM quality; poor text descriptions may limit effectiveness
Frequency-based gating assumes popularity correlates perfectly with modality reliability, which might not always hold
Increases model parameter count due to multiple embedding tables and experts

Reproducibility

Code: https://github.com/Applied-Machine-Learning-Lab/PAD

Code and datasets are publicly available at https://github.com/Applied-Machine-Learning-Lab/PAD. Hyperparameters for reproduction are detailed in the appendix.

📊 Experiments & Results

Evaluation Setup

Sequential recommendation on sparse datasets

Benchmarks:

Amazon Sports (Sequential Item Recommendation)
Amazon Beauty (Sequential Item Recommendation)
Amazon Toys (Sequential Item Recommendation)

Metrics:

Hit Ratio @ 10 (HR@10)
NDCG @ 10
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall performance comparison shows PAD consistently outperforming baselines across all datasets.
Amazon Sports	HR@10	0.0402	0.0428	+0.0026
Amazon Beauty	NDCG@10	0.0385	0.0456	+0.0071
Cold-start analysis demonstrates superior handling of unseen or rare items.
Amazon Beauty (Cold Items)	NDCG@10	0.0543	0.0632	+0.0089
Ablation study confirms the necessity of both alignment and disentanglement components.
Amazon Beauty	HR@10	0.0645	0.0699	+0.0054
Amazon Beauty	HR@10	0.0664	0.0699	+0.0035

Experiment Figures

Performance breakdown (NDCG@10) across item frequency groups (Cold to Hot items) on Beauty and Sports datasets.

Visualization of catastrophic forgetting: Performance of the ID backbone on ID-only tasks before and after different alignment methods.

Main Takeaways

Characteristic kernels (MK-MMD) capture data distribution better than cosine-based contrastive losses, leading to better alignment.
Rec-anchored alignment effectively prevents catastrophic forgetting; ID embeddings retain their collaborative signal after alignment.
Disentangled triple-experts allow the model to dynamically switch reliance between text (for cold items) and ID (for warm items) signals.
Consistent improvements across datasets with varying sparsity levels indicate robustness.

📚 Prerequisite Knowledge

Prerequisites

Sequential Recommendation (SASRec architecture)
Reproducing Kernel Hilbert Space (RKHS)
Maximum Mean Discrepancy (MMD)
Mixture of Experts (MoE)

Key Terms

SR: Sequential Recommendation—predicting the next item a user will interact with based on their history

RKHS: Reproducing Kernel Hilbert Space—a space of functions where evaluation is a continuous linear functional, allowing probability distributions to be embedded as points (mean embeddings)

MMD: Maximum Mean Discrepancy—a statistical test that measures the distance between two probability distributions by comparing their mean embeddings in a kernel space

MK-MMD: Multi-Kernel Maximum Mean Discrepancy—an extension of MMD using a linear combination of multiple kernels to better capture different scales of data structure

Characteristic Kernel: A kernel (like Gaussian) whose mean embedding map is injective, ensuring that MMD=0 iff the two distributions are identical (capturing all statistical moments)

Catastrophic Forgetting: A phenomenon where a model forgets previously learned information (e.g., collaborative patterns) while learning a new task (e.g., semantic alignment)

MoE: Mixture of Experts—an architecture where different sub-models ('experts') specialize in different parts of the input space, activated by a gating network

SASRec: Self-Attentive Sequential Recommendation—a standard Transformer-based baseline model for sequential recommendation