University of California, San Diego,
Netflix,
Cornell University
arXiv
(2024)
RecommendationP13N
📝 Paper Summary
Conversational Recommender Systems (CRS)Large Language Models (LLMs) for Recommendation
Reindex-Then-Adapt converts multi-token item titles in LLMs into single-token representations to enable efficient adjustment of recommendation probabilities toward target platform distributions.
Core Problem
LLMs recommend items by autoregressively generating multi-token titles, making it computationally expensive to calculate or adjust the full probability distribution over all items to match target platform popularity.
Why it matters:
LLMs trained on general corpora often recommend items (e.g., 'Black Panther') that do not match the popularity distribution of specific target platforms (e.g., ReDIAL dataset)
Target data distributions evolve rapidly (e.g., monthly popularity shifts), requiring efficient adaptation without full retraining
Current generative retrieval methods prevent easy access to full-item logits needed for standard RecSys control techniques
Concrete Example:In the ReDIAL dataset, 'The Dark Knight' is very popular, but a standard Llama-7b model rarely recommends it. Conversely, 'Black Panther' is over-recommended by the LLM compared to its actual popularity on the platform. The multi-token generation of 'The Dark Knight' makes it hard to simply boost its probability score globally.
Key Novelty
Reindex-Then-Adapt (RTA) Framework
Reindex Step: Squeezes multi-token item titles (e.g., 'Edge', 'of', 'Tomorrow') into a single token embedding using a contrastive aggregator, allowing the LLM to represent items as atomic units.
Adapt Step: Once items are single tokens, the model can efficiently compute logits for all items and apply affine transformations (bias adjustments) or mix with traditional RecSys scores to match target distributions.
Architecture
The Reindex-Then-Adapt (RTA) framework pipeline. It shows the transition from original LLM multi-token indexing to reindexed single-token embeddings, followed by the adaptation phase.
Evaluation Highlights
+59.37% Top-10 Hit Rate improvement for Llama2-7b on the ReDIAL dataset using the RTA framework
Surpasses all open-source baselines on ReDIAL, Reddit-Movie, and GoRecDial datasets
Achieves better alignment with target item popularity distributions compared to vanilla LLMs
Breakthrough Assessment
8/10
Significantly addresses the 'generative vs. discriminative' gap in LLM recommendation by making generative logits accessible for control, showing massive empirical gains.
⚙️ Technical Details
Problem Definition
Setting: Conversational Recommendation where a system suggests a ranked list of items based on dialogue context
Inputs: Dialogue context comprising user utterances and system responses
Outputs: Ranked list of item indices (titles) recommended to the user
Pipeline Flow
Reindexing: Train aggregator to map multi-token titles to single embeddings
Inference: LLM generates context embedding
Adaptation: Adjust logits via bias/gating
Ranking: Output top-k items
System Modules
Aggregator
Compress multi-token item titles into single-token representations
Model or implementation: Trainable aggregator (e.g., MLP or RNN-based)
LLM Backbone
Process conversation context and generate query embedding
Model or implementation: Llama-2-7b
Adaptation Layer
Adjust prediction logits to match target distribution
Model or implementation: Affine transformation (diagonal W and bias b) or Gating mechanism
Novel Architectural Elements
Separation of 'Reindex' and 'Adapt' phases to enable full-vocabulary logit access in generative models
Use of contrastive learning to squeeze pre-trained multi-token semantics into new single-token slots without losing L2I knowledge
Modeling
Base Model: Llama-2-7b
Training Method: Two-stage training: (1) Reindexing via Contrastive Loss, (2) Adaptation via Maximum Likelihood Estimation
Objective Functions:
Purpose: Train aggregator to represent multi-token items as single tokens.
Formally: Contrastive loss L_reindex minimizing distance between context embedding q and aggregated item embedding v_tilde.
Purpose: Learn adaptation parameters (bias terms) to match target data.
Formally: Maximum Likelihood Estimation L_adapt maximizing probability of ground-truth items in target dataset.
Training Data:
Reindex step uses mixture of L2I (content-target) and L2R (query-target) samples
Adapt step uses target platform data (e.g., ReDIAL)
Compute: Not reported in the paper
Comparison to Prior Work
vs. UniCRS: RTA focuses on aligning output distributions rather than just optimizing generation quality
vs. BART-based methods: RTA uses a decoder-only LLM (Llama-2) and specifically addresses the multi-token indexing issue
vs. Pop-Bias Mitigation [not cited in paper]: RTA actively adapts to popularity rather than just debiasing, allowing control over the distribution.
Limitations
Requires re-indexing step which adds computational overhead before adaptation
Adaptation effectiveness depends on the quality of the initial L2I knowledge in the LLM
Evaluation limited to movie domain datasets (ReDIAL, Reddit-Movie)
Does not explicitly address cold-start items not present in LLM pre-training
Reproducibility
No code URL provided in the paper. Datasets (ReDIAL, Reddit-Movie) are public. Base model (Llama-2) is public.
📊 Experiments & Results
Evaluation Setup
Conversational recommendation on movie datasets
Benchmarks:
ReDIAL (Conversational Recommendation)
GoRecDial (Conversational Recommendation)
Reddit-Movie (Conversational Recommendation)
Metrics:
Hit@K (Hit Rate)
NDCG@K (Normalized Discounted Cumulative Gain)
MRR@K (Mean Reciprocal Rank)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
ReDIAL
Hit@10
0.1435
0.2287
+0.0852
ReDIAL
NDCG@10
0.0812
0.1345
+0.0533
GoRecDial
Hit@10
0.2612
0.3814
+0.1202
Reddit-Movie
Hit@10
0.1105
0.1782
+0.0677
Experiment Figures
Comparison of item popularity distributions between ReDIAL dataset, Llama-7b, and RTA-Llama-7b.
Recall@1 performance of various LLMs on the Learn to Index (L2I) task, stratified by item occurrence frequency.
Main Takeaways
RTA significantly improves recommendation accuracy across all three datasets compared to the original LLM
Adaptation via bias terms effectively aligns LLM output with target popularity distributions
Combining LLMs with traditional RecSys (via gating) yields further improvements, leveraging the complementary strengths of semantic understanding and collaborative filtering
The 'Reindex' step successfully preserves item semantics while enabling efficient 'Adapt' step operations
📚 Prerequisite Knowledge
Prerequisites
Transformer-based Language Models
Differentiable Search Index (DSI)
Contrastive Learning
Key Terms
DSI: Differentiable Search Index—a paradigm where a model maps queries directly to item identifiers (docids) stored within its parameters
L2I: Learn to Index—a DSI task where the model learns to map item content (e.g., descriptions) to item identifiers
L2R: Learn to Retrieve—a DSI task where the model maps queries to item identifiers
Logit: The raw, unnormalized prediction score output by a neural network before applying a function like Softmax
Affne transformation: A linear mapping method (Wx + b) used here to adjust the logits of the LLM to match target distributions
RecSys: Recommender Systems—traditional algorithms (like matrix factorization or two-tower models) focused on collaborative filtering
Contrastive loss: A loss function that pulls representations of similar pairs close together while pushing dissimilar pairs apart
CRS: Conversational Recommender Systems—systems that recommend items through interactive dialogue