Preference Discerning with LLM-Enhanced Generative Retrieval

📝 Paper Summary

Sequential Recommendation Generative Retrieval LLM-based Recommendation

The paper introduces 'preference discerning,' a paradigm where generative recommendation models are explicitly conditioned on natural language user preferences to dynamically steer recommendations without retraining.

Core Problem

Current sequential recommendation models rely solely on past interaction history, making them unable to dynamically adapt to changing user interests (e.g., new hobbies) without retraining.

Why it matters:

Static models reinforce echo chambers by continuing to recommend similar content even when user intent shifts
Adapting to life changes (e.g., career transition) currently requires slow model retraining rather than immediate response
Existing methods lack a mechanism for users to explicitly steer recommendations using natural language instructions

Concrete Example: A user who historically watched entertainment videos starts learning a new skill (e.g., coding). Current models continue recommending viral entertainment videos instead of tutorials because they only see the long history of entertainment, ignoring the recent shift in intent.

Key Novelty

Preference Discerning (Mender)

Decouples preference extraction from recommendation: uses an LLM to distill history into concise text preferences, then conditions the recommender on these text preferences
Fuses semantic item IDs (collaborative filtering) with pre-trained language encoders (semantic understanding) to allow direct steering via natural language
Introduces a 5-axis benchmark to evaluate steerability, including sentiment following and fine-grained control

Architecture

The architecture of Mender (Multimodal Preference Discerner). It shows the two-stream input: interaction history and natural language preferences.

Evaluation Highlights

Mender outperforms TIGER (state-of-the-art generative retrieval) by ~20-30% on preference-based recommendation tasks across Sports and Beauty datasets
Achieves superior sentiment following: effectively avoids items associated with negative preferences while retrieving positive ones, unlike standard sequential baselines
Demonstrates zero-shot steerability: can be guided by preferences not seen during training to recommend semantically related items

Breakthrough Assessment

7/10

Strong conceptual contribution in making recommenders steerable via language. The benchmark is comprehensive. Performance gains are significant, though the reliance on generated preferences adds pipeline complexity.

⚙️ Technical Details

Problem Definition

Setting: Sequential recommendation where the next item prediction is conditioned on both interaction history and explicit natural language user preferences

Inputs: Sequence of past items s_u and a set of generated user preferences P_u in natural language

Outputs: Predicted next item i_(t)

Pipeline Flow

Preference Approximation: History + Item Info → LLM → Natural Language Preferences
Preference Conditioning: Preferences + History → Multimodal Encoder → Decoder → Semantic IDs

System Modules

Preference Generator

Distill user interaction history and reviews into concise natural language preference statements

Model or implementation: LLM (Specific model not detailed in summary, likely Llama or similar)

Item Quantizer

Convert item descriptions into discrete Semantic IDs for generative retrieval

Model or implementation: RQ-VAE trained on Sentence-T5 embeddings

Mender Encoder

Encode natural language preferences and interaction history into a joint representation

Model or implementation: FLAN-T5-Small Encoder

Mender Decoder

Autoregressively generate the Semantic IDs of the recommended item

Model or implementation: Transformer Decoder (randomly initialized)

Novel Architectural Elements

Multimodal fusion in generative retrieval: combining pre-trained text encoders (FLAN-T5) with learned semantic ID decoders
Explicit conditioning mechanism: inserting generated natural language preferences directly into the recommender's context window to steer output

Modeling

Base Model: Mender (based on TIGER backbone with FLAN-T5 encoder)

Training Method: Supervised training on next-item prediction with preference conditioning

Objective Functions:

Purpose: Minimize the negative log-likelihood of the correct semantic ID sequence.

Formally: Standard autoregressive language modeling loss on the discrete codes.

Training Data:

Amazon Sports and Beauty datasets
Augmented with generated preferences via LLM
Split into training axes: Preference-based, Sentiment, Steering, History Consolidation

Key Hyperparameters:

RQ-VAE codebook size: Not explicitly reported in the paper
RQ-VAE depth: 3 (tuple size)
Encoder model: FLAN-T5-Small
+ 1 more
Decoder initialization: Random

Compute: Not reported in the paper

Comparison to Prior Work

vs. TIGER: Mender conditions on natural language preferences, whereas TIGER relies only on ID history
vs. SASRec: Mender uses generative retrieval and multimodal input, SASRec uses discriminative ranking on IDs
vs. P5: P5 formulates recommendation as pure text-to-text; Mender uses Semantic IDs for efficiency and precision [not cited in paper but relevant comparison]

Limitations

Relies on the quality of preferences generated by the LLM; poor approximation leads to poor steering
Increases inference complexity due to the need for LLM generation of preferences before recommendation
Mender_Tok variant is computationally heavier than ID-based baselines due to longer text sequences

Reproducibility

Code: https://github.com/facebookresearch/preference_discerning

Code is publicly available at https://github.com/facebookresearch/preference_discerning. The paper uses public Amazon datasets (Sports, Beauty). Preference generation relies on an LLM, but specific prompts are provided in Appendix C.

📊 Experiments & Results

Evaluation Setup

Sequential recommendation on Amazon Sports and Beauty datasets, augmented with generated preferences

Benchmarks:

Amazon Sports (Sequential Recommendation)
Amazon Beauty (Sequential Recommendation)

Metrics:

Hit Rate @ 5 (HR@5)
Hit Rate @ 10 (HR@10)
NDCG @ 5
NDCG @ 10
Combined Hit Rate (for sentiment following)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Preference-based Recommendation: Mender variants significantly outperform baselines when explicit preferences are provided.
Amazon Sports	NDCG@10	31.2	60.5	+29.3
Amazon Beauty	NDCG@10	36.8	59.3	+22.5
Steering Capabilities: Mender shows superior ability to be steered towards similar or distinct items.
Amazon Sports	HR@10	5.4	25.1	+19.7
Amazon Sports	HR@10	0.4	11.2	+10.8
Sentiment Following: Mender can distinguish between positive and negative preferences.
Amazon Sports	Combined Hit Rate	0.1	12.3	+12.2

Experiment Figures

Performance (Hit Rate) on Fine-Grained vs. Coarse-Grained steering tasks.

Sentiment following results, comparing positive vs. negative preference conditioning.

Main Takeaways

Explicitly conditioning on natural language preferences (Preference Discerning) drastically improves recommendation accuracy when user intent is known.
Standard generative retrieval models (TIGER) completely lack the ability to be steered by text preferences, reinforcing the need for the Mender architecture.
Mender demonstrates strong 'sentiment following,' successfully avoiding items when given negative preferences, a capability absent in traditional ID-based models.
The method is effective for both fine-grained (similar items) and coarse-grained (distinct items) steering, enabling dynamic adaptation without retraining.

📚 Prerequisite Knowledge

Prerequisites

Generative Retrieval (representing items as token sequences)
Semantic IDs (RQ-VAE)
Transformer-based sequence modeling
Large Language Models for text generation

Key Terms

Generative Retrieval: A recommendation paradigm that uses autoregressive models to generate item identifiers directly, rather than ranking existing embeddings

Semantic IDs: Discrete tokens representing items, derived from hierarchical quantization (RQ-VAE) of item embeddings, preserving semantic similarity

RQ-VAE: Residual Quantized Variational AutoEncoder—a method to compress high-dimensional vectors into discrete codebook indices (Semantic IDs)

TIGER: Transformer Index for Generative Recommenders—a state-of-the-art generative retrieval model used as the backbone for Mender

Preference Discerning: The proposed paradigm of explicitly conditioning recommendation models on user preferences expressed in natural language

Steerability: The ability to guide a recommendation model towards or away from specific items using external signals (here, text preferences)

Sentence-T5: A pre-trained encoder model used to generate embeddings for sentences, used here to create semantic IDs for items

Hit Rate (HR@k): A metric measuring the proportion of test cases where the ground truth item is present in the top-k predicted items

NDCG: Normalized Discounted Cumulative Gain—a ranking metric that accounts for the position of relevant items in the recommendation list