Reindex-Then-Adapt: Improving Large Language Models for Conversational Recommendation

📝 Paper Summary

Conversational Recommender Systems (CRS) Large Language Models (LLMs) for Recommendation

Reindex-Then-Adapt converts multi-token item titles in LLMs into single-token representations to enable efficient adjustment of recommendation probabilities toward target platform distributions.

Core Problem

LLMs recommend items by autoregressively generating multi-token titles, making it computationally expensive to calculate or adjust the full probability distribution over all items to match target platform popularity.

Why it matters:

LLMs trained on general corpora often recommend items (e.g., 'Black Panther') that do not match the popularity distribution of specific target platforms (e.g., ReDIAL dataset)
Target data distributions evolve rapidly (e.g., monthly popularity shifts), requiring efficient adaptation without full retraining
Current generative retrieval methods prevent easy access to full-item logits needed for standard RecSys control techniques

Concrete Example: In the ReDIAL dataset, 'The Dark Knight' is very popular, but a standard Llama-7b model rarely recommends it. Conversely, 'Black Panther' is over-recommended by the LLM compared to its actual popularity on the platform. The multi-token generation of 'The Dark Knight' makes it hard to simply boost its probability score globally.

Key Novelty

Reindex-Then-Adapt (RTA) Framework

Reindex Step: Squeezes multi-token item titles (e.g., 'Edge', 'of', 'Tomorrow') into a single token embedding using a contrastive aggregator, allowing the LLM to represent items as atomic units.
Adapt Step: Once items are single tokens, the model can efficiently compute logits for all items and apply affine transformations (bias adjustments) or mix with traditional RecSys scores to match target distributions.

Architecture

The Reindex-Then-Adapt (RTA) framework pipeline. It shows the transition from original LLM multi-token indexing to reindexed single-token embeddings, followed by the adaptation phase.

Evaluation Highlights

+59.37% Top-10 Hit Rate improvement for Llama2-7b on the ReDIAL dataset using the RTA framework
Surpasses all open-source baselines on ReDIAL, Reddit-Movie, and GoRecDial datasets
Achieves better alignment with target item popularity distributions compared to vanilla LLMs

Breakthrough Assessment

8/10

Significantly addresses the 'generative vs. discriminative' gap in LLM recommendation by making generative logits accessible for control, showing massive empirical gains.

⚙️ Technical Details

Problem Definition

Setting: Conversational Recommendation where a system suggests a ranked list of items based on dialogue context

Inputs: Dialogue context comprising user utterances and system responses

Outputs: Ranked list of item indices (titles) recommended to the user

Pipeline Flow

Reindexing: Train aggregator to map multi-token titles to single embeddings
Inference: LLM generates context embedding
Adaptation: Adjust logits via bias/gating
Ranking: Output top-k items

System Modules

Aggregator

Compress multi-token item titles into single-token representations

Model or implementation: Trainable aggregator (e.g., MLP or RNN-based)

LLM Backbone

Process conversation context and generate query embedding

Model or implementation: Llama-2-7b

Adaptation Layer

Adjust prediction logits to match target distribution

Model or implementation: Affine transformation (diagonal W and bias b) or Gating mechanism

Novel Architectural Elements

Separation of 'Reindex' and 'Adapt' phases to enable full-vocabulary logit access in generative models
Use of contrastive learning to squeeze pre-trained multi-token semantics into new single-token slots without losing L2I knowledge

Modeling

Base Model: Llama-2-7b

Training Method: Two-stage training: (1) Reindexing via Contrastive Loss, (2) Adaptation via Maximum Likelihood Estimation

Objective Functions:

Purpose: Train aggregator to represent multi-token items as single tokens.

Formally: Contrastive loss L_reindex minimizing distance between context embedding q and aggregated item embedding v_tilde.
Purpose: Learn adaptation parameters (bias terms) to match target data.

Formally: Maximum Likelihood Estimation L_adapt maximizing probability of ground-truth items in target dataset.

Training Data:

Reindex step uses mixture of L2I (content-target) and L2R (query-target) samples
Adapt step uses target platform data (e.g., ReDIAL)

Compute: Not reported in the paper

Comparison to Prior Work

vs. UniCRS: RTA focuses on aligning output distributions rather than just optimizing generation quality
vs. BART-based methods: RTA uses a decoder-only LLM (Llama-2) and specifically addresses the multi-token indexing issue
vs. Pop-Bias Mitigation [not cited in paper]: RTA actively adapts to popularity rather than just debiasing, allowing control over the distribution.

Limitations

Requires re-indexing step which adds computational overhead before adaptation
Adaptation effectiveness depends on the quality of the initial L2I knowledge in the LLM
Evaluation limited to movie domain datasets (ReDIAL, Reddit-Movie)
Does not explicitly address cold-start items not present in LLM pre-training

Reproducibility

No code URL provided in the paper. Datasets (ReDIAL, Reddit-Movie) are public. Base model (Llama-2) is public.

📊 Experiments & Results

Evaluation Setup

Conversational recommendation on movie datasets

Benchmarks:

ReDIAL (Conversational Recommendation)
GoRecDial (Conversational Recommendation)
Reddit-Movie (Conversational Recommendation)

Metrics:

Hit@K (Hit Rate)
NDCG@K (Normalized Discounted Cumulative Gain)
MRR@K (Mean Reciprocal Rank)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ReDIAL	Hit@10	0.1435	0.2287	+0.0852
ReDIAL	NDCG@10	0.0812	0.1345	+0.0533
GoRecDial	Hit@10	0.2612	0.3814	+0.1202
Reddit-Movie	Hit@10	0.1105	0.1782	+0.0677

Experiment Figures

Comparison of item popularity distributions between ReDIAL dataset, Llama-7b, and RTA-Llama-7b.

Recall@1 performance of various LLMs on the Learn to Index (L2I) task, stratified by item occurrence frequency.

Main Takeaways

RTA significantly improves recommendation accuracy across all three datasets compared to the original LLM
Adaptation via bias terms effectively aligns LLM output with target popularity distributions
Combining LLMs with traditional RecSys (via gating) yields further improvements, leveraging the complementary strengths of semantic understanding and collaborative filtering
The 'Reindex' step successfully preserves item semantics while enabling efficient 'Adapt' step operations

📚 Prerequisite Knowledge

Prerequisites

Transformer-based Language Models
Differentiable Search Index (DSI)
Contrastive Learning

Key Terms

DSI: Differentiable Search Index—a paradigm where a model maps queries directly to item identifiers (docids) stored within its parameters

L2I: Learn to Index—a DSI task where the model learns to map item content (e.g., descriptions) to item identifiers

L2R: Learn to Retrieve—a DSI task where the model maps queries to item identifiers

Logit: The raw, unnormalized prediction score output by a neural network before applying a function like Softmax

Affne transformation: A linear mapping method (Wx + b) used here to adjust the logits of the LLM to match target distributions

RecSys: Recommender Systems—traditional algorithms (like matrix factorization or two-tower models) focused on collaborative filtering

Contrastive loss: A loss function that pulls representations of similar pairs close together while pushing dissimilar pairs apart

CRS: Conversational Recommender Systems—systems that recommend items through interactive dialogue