RecExplainer: Aligning Large Language Models for Explaining Recommendation Models

📝 Paper Summary

Explainable Recommendation LLM-based Recommendation

RecExplainer fine-tunes large language models to act as surrogate explainers by aligning them with target recommendation models through behavior mimicry and latent space comprehension.

Core Problem

Embedding-based recommender systems are effective but operate as black boxes, lacking transparency and interpretability for users and developers.

Why it matters:

Traditional surrogate models (like decision trees) sacrifice fidelity for interpretability, while complex models lack human-readable explanations
Existing explanations are often limited to simple weights or rules, missing semantic, human-readable reasoning

Concrete Example: A traditional recommender suggests a movie based on latent vectors, but cannot explain *why*. RecExplainer allows an LLM to output: 'I recommend this movie because it aligns with your interest in Sci-Fi shown by your history of watching Star Wars.'

Key Novelty

RecExplainer (Three Alignment Strategies)

Behavior Alignment: Fine-tunes the LLM to predict the target model's output (items) given textual user history, mimicking the target's external behavior
Intention Alignment: Injects the target model's internal user/item embeddings directly into the LLM's input space, treating embeddings as a new language modality
Hybrid Alignment: Combines both textual history and latent embeddings to reduce hallucination and improve fidelity

Architecture

The model architecture for Intention Alignment. It illustrates how user/item embeddings from the frozen target model are projected via a linear layer into the LLM's input space.

Evaluation Highlights

Hybrid alignment achieves significantly better alignment accuracy (Acc@1) than standard LLM prompting (e.g., 29.8% vs ~3% on Amazon Beauty)
Explanations generated by the aligned LLM are rated higher by humans and GPT-4 for helpfulness and clarity compared to baselines
Intention alignment proves more effective than behavior alignment for pure recommendation tasks, but hybrid alignment balances generation quality best

Breakthrough Assessment

7/10

Novel application of LLMs as surrogate models using embedding alignment. Strong empirical results on alignment, though primarily an integration of existing techniques (LLM tuning + projection layers).

⚙️ Technical Details

Problem Definition

Setting: Given a trained target recommender model f, tune an LLM g to serve as a surrogate explainer that mimics f's behavior and explains its decisions.

Inputs: User history sequence x_u (item titles or embeddings), candidate items

Outputs: Predicted items (mimicking f), explanations for recommendations, or interest classifications

Pipeline Flow

Input Processing (Text or Embeddings)
Projection (if Intention/Hybrid)
LLM Inference

System Modules

Projector

Maps recommender embeddings (e.g., dim 32) to LLM input dimension (e.g., dim 4096)

Model or implementation: Linear Layer (MLP)

Surrogate LLM

Generates explanations or predictions mimicking the target model

Model or implementation: Llama-2-7b / Vicuna-7b (fine-tuned)

Novel Architectural Elements

Use of alignment tasks (Next item retrieval, Ranking, Discrimination) specifically to align LLM with a fixed external recommender's decision boundary
Hybrid prompt structure combining explicit item titles and implicit recommender embeddings

Modeling

Base Model: Llama-2-7b-chat, Vicuna-7b-v1.3

Training Method: Supervised Fine-Tuning (SFT) with LoRA

Objective Functions:

Purpose: Minimize the difference between LLM generation and target outputs.

Formally: Standard Cross-Entropy Loss on next-token prediction.

Adaptation: LoRA (Low-Rank Adaptation)

Training Data:

Tasks 1-5 generated from recommender datasets (Amazon, Yelp, Steam)
Target labels derived from the target recommender model f (not ground truth)
Task 6 (History Reconstruction) uses GPT-4 generated summaries

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
epochs: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. P5/TallRec: RecExplainer aligns with a *target model's* predictions, not the ground truth data. The goal is explanation of a specific system, not just recommendation.
vs. ChatGPT (Zero-shot): RecExplainer uses internal embeddings (Intention Alignment) to ground the LLM in the specific recommender's latent space.
vs. LIME/SHAP [not cited in paper]: RecExplainer generates natural language explanations via an LLM surrogate, whereas LIME/SHAP provide feature importance weights.

Limitations

Dependency on the quality of the target recommender model; if the target is poor, the explanation might be faithful but not useful.
Hybrid alignment increases input complexity and token usage compared to behavior alignment.
Requires re-training/fine-tuning the LLM (or adapter) for each specific target recommender model.

Reproducibility

Code: https://github.com/microsoft/RecAI

Code is publicly available at https://github.com/microsoft/RecAI. Datasets (Amazon, Yelp, Steam) are public. Target models (MF, SASRec) are standard baselines.

📊 Experiments & Results

Evaluation Setup

Align LLMs to three target models (MF, SASRec, Unigram) across three datasets.

Benchmarks:

Amazon Beauty (E-commerce Recommendation)
Yelp (Business Recommendation)
Steam (Game Recommendation)

Metrics:

Alignment Accuracy (Acc@1, Acc@5, Acc@10)
NDCG@10 (Ranking alignment)
Explanation Quality (Human Eval, GPT-4 Eval)
Text Generation Metrics (BLEU, ROUGE - for reconstruction)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Alignment performance on Amazon Beauty dataset. RecExplainer-H (Hybrid) consistently outperforms zero-shot and few-shot baselines in mimicking the target model (SASRec).
Amazon Beauty (Target: SASRec)	Acc@1	0.027	0.298	+0.271
Amazon Beauty (Target: SASRec)	Acc@1	0.185	0.298	+0.113
Yelp (Target: MF)	NDCG@5	0.641	0.932	+0.291
Yelp	GPT-4 Eval Score (1-10)	7.38	8.06	+0.68
Yelp	Human Eval (Win Rate)	50%	69%	+19%

Experiment Figures

Human evaluation results comparing RecExplainer-H explanations against baselines (Real Reviews, GPT-4, etc.).

Main Takeaways

Alignment methods (Behavior, Intention, Hybrid) vastly outperform standard prompting (Zero/Few-shot) in mimicking target model decisions.
Hybrid alignment is generally the most robust, leveraging both semantic text information and precise latent embeddings.
Intention alignment (using only embeddings) can suffer from 'hallucination' or loss of semantic detail if not augmented with text (Task 6 helps mitigate this).
The approach generalizes across different target models (Matrix Factorization, SASRec) and domains (E-commerce, Local Business, Gaming).

📚 Prerequisite Knowledge

Prerequisites

Collaborative Filtering
Instruction Tuning / Parameter-Efficient Fine-Tuning (PEFT)
Latent Factor Models (Two-Tower architecture)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

LLM: Large Language Model

Surrogate Model: An interpretable model trained to approximate the predictions of a complex, black-box model to provide explanations

Behavior Alignment: Training the LLM to mimic the input-output behavior of the target model using text

Intention Alignment: Training the LLM to understand the target model's internal state by projecting recommender embeddings into the LLM's token space

Hybrid Alignment: Combining text-based and embedding-based inputs to train the LLM

Two-Tower Model: A recommender architecture where users and items are encoded independently into embeddings, and their dot product determines the score

ShareGPT: A dataset of user-ChatGPT conversations used for instruction tuning to maintain general LLM capabilities

Unigram: A simple baseline that recommends items based on their global popularity frequency

SASRec: Self-Attentive Sequential Recommendation—a sequence-based recommender model used as a target black-box model in this paper

MF: Matrix Factorization—a classic collaborative filtering method used as a target black-box model

BLEU: Bilingual Evaluation Understudy—a metric for evaluating the quality of text which computes the overlap of n-grams between candidate and reference texts

ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a set of metrics used to evaluate automatic summarization and machine translation

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that takes into account the position of relevant items

HR: Hit Ratio—the fraction of users for whom the correct item is included in the recommendation list