Knowledge Plugins: Enhancing Large Language Models for Domain-Specific Recommendations

📝 Paper Summary

LLM-based Recommender Systems Retrieval-Augmented Generation (RAG)

DOKE augments frozen LLMs with domain-specific knowledge (item attributes and collaborative filtering signals) via prompt engineering, enabling high-performance recommendation without fine-tuning.

Core Problem

General-purpose LLMs lack two critical types of domain knowledge required for accurate recommendations: the full, evolving dataset of items and specific working patterns (like collaborative filtering signals) inherent in user interaction logs.

Why it matters:

Fine-tuning massive LLMs on domain data is computationally expensive and prone to overfitting, potentially sacrificing general intelligence
LLMs hallucinate on less popular or fresh items not present in their pre-training corpus
Purely semantic-based recommendations by LLMs miss behavioral patterns (collaborative signals) found in interaction data

Concrete Example: In a movie recommendation task, an LLM might recommend 'Good Will Hunting' to a user who watched 'Rain Man' because both are dramas (semantic similarity). However, domain interaction data shows users actually co-click 'Field of Dreams' (collaborative signal), a pattern the LLM misses without external knowledge.

Key Novelty

Domain-Specific Knowledge Extraction (DOKE) Paradigm

Treats domain knowledge (attributes and interaction patterns) as 'plugins' injected into prompts rather than weights to be learned via fine-tuning
Extracts collaborative filtering signals (Item-to-Item and User-to-Item relevance) from interaction logs using lightweight external models
Translates numerical relevance scores into LLM-understandable formats: natural language templates or reasoning paths on a knowledge graph

Architecture

The workflow of the DOKE paradigm instantiated for Recommender Systems.

Evaluation Highlights

Significantly outperforms zero-shot LLM baselines (e.g., +84.3% NDCG@1 on ML-1M for ChatGPT) by incorporating customized domain knowledge
Achieves performance comparable to fully trained traditional models (e.g., SASRec) and fine-tuned LLMs (Llama-2-7b) without updating parameters
Customized knowledge (history-candidate relevance) yields higher gains than global knowledge, improving NDCG@10 by +40.5% on ML-1M over standard prompts

Breakthrough Assessment

7/10

Strong practical contribution demonstrating that cheap prompt augmentation with CF signals competes with expensive fine-tuning. While methodologically simple, it effectively bridges the gap between semantic LLM reasoning and behavioral recommendation data.

⚙️ Technical Details

Problem Definition

Setting: Re-ranking a candidate set of items for a user based on interaction history

Inputs: User interaction history H_u, a set of candidate items C retrieved by an off-the-shelf method

Outputs: A ranked list of the candidate items C

Pipeline Flow

Knowledge Preparation (Extract attributes and CF signals from data)
Knowledge Customization (Filter knowledge relevant to current user/candidates)
Knowledge Expression (Convert to text/KG paths)
LLM Inference (Generate ranking via prompting)

System Modules

Domain Knowledge Extractor (Input Processing)

Mines item attributes and calculates relevance scores (Item-2-Item, User-2-Item) using external models like SASRec or simple co-occurrence

Model or implementation: Lightweight external models (e.g., statistical co-occurrence, pre-trained SASRec)

Knowledge Customizer

Selects the most relevant knowledge pieces for the specific user history and candidate set

Model or implementation: Heuristic selection

Knowledge Expresser (Input Processing)

Translates structured knowledge into natural language for the LLM

Model or implementation: Template-based or Graph-based formatter

LLM Ranker

Ranks the candidate items based on the augmented prompt

Model or implementation: ChatGPT / Davinci / Llama-2-7b

Novel Architectural Elements

Decoupled Knowledge Extraction: Separating the 'reasoning' (LLM) from the 'domain memory' (External Extractor) to avoid fine-tuning
KG-based Rationale Injection: Using reasoning paths from a Knowledge Graph explicitly as prompt context to explain collaborative filtering signals

Modeling

Base Model: ChatGPT (gpt-3.5-turbo), Text-Davinci-003, Llama-2-7b

Training Method: In-context learning via Prompt Engineering

Compute: No training compute for DOKE. Inference latency depends on LLM API or local inference + lightweight extractor overhead.

Comparison to Prior Work

vs. P5: DOKE avoids fine-tuning entirely, utilizing external extractors to inject knowledge via prompts
vs. TALLRec: DOKE relies on prompt augmentation rather than parameter adaptation (LoRA/tuning)
vs. ChatGPT-Recency/ICL: DOKE injects explicit CF signals (co-occurrence/model-based relevance) rather than just formatting history/examples [not cited in paper]

Limitations

Dependency on external models (SASRec/MF) to extract high-quality domain signals
Prompt length constraints limit the amount of knowledge that can be injected
Inference cost of calling LLMs for ranking is significantly higher than traditional dot-product retrieval
Effectiveness relies on the quality of the underlying domain data and knowledge graph availability

Reproducibility

Code: http://github.com/microsoft/DOKE

publicly available (http://github.com/microsoft/DOKE). Uses standard datasets (MovieLens-1M, Amazon Beauty, Online Retail). Baselines include standard implementations (RecBole, etc.). Relies on OpenAI API for main results.

📊 Experiments & Results

Evaluation Setup

Re-ranking 20 candidate items (1 positive, 19 negatives) based on user history. Interactions sorted by timestamp.

Benchmarks:

MovieLens-1M (Movie Recommendation)
Amazon Beauty (Product Recommendation)
Online Retail (E-commerce Transaction Prediction)

Metrics:

NDCG@1, NDCG@5, NDCG@10
Hit Ratio (HR) @ 10
Statistical methodology: t-test at p<0.05 level reported for significance against baselines

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DOKE significantly improves zero-shot LLM performance, bridging the gap with fully trained models.
MovieLens-1M	NDCG@10	0.5296	0.7451	+0.2155
Amazon Beauty	NDCG@10	0.3628	0.5145	+0.1517
Online Retail	NDCG@10	0.4150	0.7216	+0.3066
Comparison against fully trained baselines shows DOKE is competitive.
MovieLens-1M	NDCG@10	0.7495	0.7451	-0.0044
Ablation study demonstrates the impact of different knowledge types.
MovieLens-1M	NDCG@1	0.2750	0.5067	+0.2317

Experiment Figures

Group analysis of performance based on item popularity and user history length.

Main Takeaways

Domain knowledge (CF signals) is the primary driver of performance; item attributes provide minor gains compared to behavioral patterns.
Customizing knowledge to the specific candidate set (History & Candidate based) is far more effective than providing global or history-only knowledge.
Reasoning paths (KG explanations) provide slight improvements over simple text templates for expressing collaborative filtering signals.
DOKE is robust to prompt variations, with improvements stemming from the information content rather than specific wording.
Improvements are most pronounced on samples where traditional baselines (used to extract the knowledge) also perform well (popular items, long history).

📚 Prerequisite Knowledge

Prerequisites

Collaborative Filtering (CF) concepts
Matrix Factorization (MF)
Prompt Engineering (In-Context Learning)
Knowledge Graph reasoning paths

Key Terms

CF: Collaborative Filtering—making predictions about a user's interests by collecting preferences from many users

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items

HR: Hit Ratio—the fraction of users for whom the correct item is included in the recommendation list

SASRec: Self-Attentive Sequential Recommendation—a sequential model using attention mechanisms to predict next items

Zero-shot: Evaluating a model on a task without any specific gradient-based training on that task's data

Co-click probability: The likelihood that two items are clicked by the same user, serving as a strong signal of item similarity in behavioral data

Reasoning Path: A sequence of entities and relations in a Knowledge Graph connecting two items (e.g., Movie A -> has genre -> Action -> genre of -> Movie B)

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained model weights and injects trainable rank decomposition matrices