Bridging the Information Gap Between Domain-Specific Model and General LLM for Personalized Recommendation

📝 Paper Summary

LLM-based Recommendation Collaborative Filtering Hybrid Recommendation Systems

BDLM aligns Large Language Models with domain-specific recommendation models via a shared embedding module and mutual learning, allowing the LLM to access community behavior patterns while enhancing the domain model with semantic knowledge.

Core Problem

General LLMs lack access to collaborative signals (community behavior patterns) critical for recommendation, while domain-specific models struggle with data sparsity due to a lack of general knowledge.

Why it matters:

LLMs struggle to distinguish similar items or capture latent community trends purely from text prompts
Domain-specific models (like Matrix Factorization) fail when interaction data is sparse because they cannot leverage semantic item content
Existing methods that translate interaction history into text prompts ('text-is-all-you-need') lose structural graph information

Concrete Example: In e-commerce, two shoes might have nearly identical text descriptions (e.g., 'Black Leather Shoes'), making them indistinguishable to an LLM. However, a domain model knows distinct user groups buy them. Conversely, a domain model fails on a new item with no clicks, whereas an LLM can infer its appeal from its description.

Key Novelty

Bridge Domain-specific and LLM models (BDLM)

Introduces task-specific tokens (<uid>, <iid>) into the LLM's vocabulary, initialized with embeddings from a domain-specific model (like LightGCN) to transfer behavioral patterns
Uses a deep mutual learning strategy where the LLM and domain model iteratively update a shared 'information sharing module' to align their representations in the same latent space

Evaluation Highlights

+22.7% HR@1 improvement on MovieLens-1M compared to state-of-the-art domain model (LightGCN)
+29.8% HR@1 improvement on Amazon-Grocery compared to LightGCN
Significantly outperforms text-only LLM baselines (InstructRec) across all datasets, particularly in e-commerce domains where text descriptions are less discriminative than movie titles

Breakthrough Assessment

7/10

Strong conceptual contribution in bridging the 'modality gap' between ID-based and text-based recommendation. The mutual learning approach is effective, though the architecture relies on standard components.

⚙️ Technical Details

Problem Definition

Setting: Personalized Top-K recommendation and Interaction Prediction

Inputs: User set U, Item set I, interaction matrix R, and text prompts C_ui containing user/item context

Outputs: Predicted interaction probability or top-K item list

Pipeline Flow

Domain Model Pre-training (LightGCN/NCF) → Initial Embeddings
LLM Token Extension (<uid>, <iid> initialized from Domain Model)
Joint Training Loop: LLM SFT Update ↔ Information Sharing Module Update ↔ Domain Model Update

System Modules

Domain Information Enhanced LLM

Generate recommendations using both text and behavioral embeddings

Model or implementation: Vicuna-7B (SFT version of LLaMA-7B)

Common Knowledge Enhanced Domain Model

Predict interaction probability using structural and semantic signals

Model or implementation: LightGCN or NCF

Information Sharing Module

Store and synchronize embeddings between the two models

Model or implementation: Shared Parameter Storage (Matrices M_u, M_i)

Novel Architectural Elements

Information Sharing Module acting as a bridge for collaborative training parameters
Mixed embedding layer in LLM initialized from Graph Neural Network (LightGCN) weights

Modeling

Base Model: Vicuna-7B

Training Method: Joint SFT and Mutual Learning

Objective Functions:

Purpose: Optimize LLM for generation.

Formally: L_llm = -log P(Answer | Prompt)
Purpose: Optimize Domain Model for interaction prediction.

Formally: L_drs = CrossEntropy(y, y_hat)
Purpose: Align LLM embeddings with Shared Module.

Formally: L_m1 = ||u_LLM - M_u||^2 + ||i_LLM - M_i||^2
Purpose: Align Domain Model embeddings with Shared Module.

Formally: L_m2 = ||u_DRS - M_u||^2 + ||i_DRS - M_i||^2

Adaptation: Full parameter SFT on LLM + Domain Model training

Trainable Parameters: All LLM parameters + Domain Model parameters + Added Token Embeddings

Key Hyperparameters:

learning_rate_LLM: 1e-5
learning_rate_Domain: 1e-4
batch_size: 16
+ 3 more
gamma: Varies by dataset (1e-1 for MovieLens, 1e-3 for Grocery, 1 for Health)
LLM_embedding_dim: 4096
Domain_embedding_dim: 4096

Compute: 2 Nvidia A800 GPUs

Comparison to Prior Work

vs. InstructRec: BDLM uses ID embeddings to capture graph signals that text cannot express
vs. GPT4SM: BDLM uses a bidirectional loop (Mutual Learning) rather than a one-way feature transfer
vs. TallRec [not cited in paper]: BDLM aligns a separate domain model rather than just tuning the LLM on Rec data

Limitations

Requires expanding LLM vocabulary size significantly (creates scaling issues for very large user/item sets)
Computationally expensive due to joint training of a 7B model and a GCN
Evaluation is limited to multi-shot scenarios; cold-start performance is mentioned as a goal but not explicitly isolated in results

Reproducibility

Code availability is not provided. Datasets (MovieLens-1M, Amazon) are public. Hyperparameters are listed, but specific prompts are only partially shown in figures.

📊 Experiments & Results

Evaluation Setup

Top-K Recommendation (ranking 20 candidates) and Interaction Prediction (binary classification)

Benchmarks:

MovieLens-1M (Movie Recommendation)
Amazon-Grocery (E-commerce Recommendation)
Amazon-Health (E-commerce Recommendation)

Metrics:

Hit Rate @ 1 (HR@1)
Hit Rate @ 2 (HR@2)
Precision
Recall
F1 Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
BDLM outperforms both traditional domain models and LLM-based baselines on Top-K recommendation tasks across all datasets.
MovieLens-1M	HR@1	0.372	0.460	+0.088
Amazon-Grocery	HR@1	0.339	0.440	+0.101
Amazon-Health	HR@1	0.344	0.403	+0.059
Ablation studies show that initialization from domain models and joint learning are both critical.
MovieLens-1M	HR@1	0.360	0.431	+0.071

Experiment Figures

Ablation study bar charts across three datasets

Main Takeaways

Text-only LLMs (InstructRec) perform well on movies but poorly on e-commerce (Grocery/Health) because product descriptions are less discriminative than movie titles; BDLM fixes this via ID embeddings.
The domain-specific model (BDLM-drs) generally outperforms the LLM side (BDLM-llm) in the final inference, benefitting from the semantic knowledge injected during joint training.
Mutual learning provides significant gains over static pre-loading of embeddings, proving the dynamic alignment loop is effective.
Generalization: BDLM works with different backbone domain models (LightGCN and NCF), consistently improving over the base versions.

📚 Prerequisite Knowledge

Prerequisites

Collaborative Filtering (Matrix Factorization, GCNs)
Large Language Models (Transformer architecture, SFT)
Embedding learning and alignment

Key Terms

SFT: Supervised Fine-Tuning—retraining a pre-trained model on a specific task dataset

Collaborative Filtering: Recommendation approach that predicts user preference based on the patterns of other similar users

NCF: Neural Collaborative Filtering—a deep learning framework that replaces the inner product of matrix factorization with a neural architecture

LightGCN: Light Graph Convolutional Network—a simplified GCN for recommendation that learns embeddings by propagating them on the user-item interaction graph

BDLM: Bridge Domain-specific and LLM models—the authors' proposed framework

Deep Mutual Learning: A training strategy where two models learn collaboratively by teaching each other, often by penalizing differences in their internal representations

Task-specific tokens: Special tokens added to the LLM vocabulary (e.g., <uid1>) to represent specific users or items, distinct from natural language words

HR@K: Hit Rate at K—the proportion of test cases where the target item appears in the top K recommendations

Zero-3 strategy: A memory optimization technique for training large models (from DeepSpeed/ZeRO) that partitions optimizer states across GPUs