Large Language Model Augmented Narrative Driven Recommendations

📝 Paper Summary

Narrative-Driven Recommendation (NDR) Data Augmentation with LLMs Dense Retrieval

Mint repurposes historical user-item interaction data by prompting LLMs to generate synthetic narrative queries, creating training data for efficient retrieval models that outperform baselines without human-labeled examples.

Core Problem

Narrative-driven recommendation (NDR) lacks abundant training data because standard datasets only contain user-item interactions (ratings/reviews), not the verbose, context-rich natural language queries users actually write.

Why it matters:

Users increasingly use conversational interfaces to solicit recommendations with complex constraints (e.g., 'quiet place for study with cheap coffee'), which keyword search handles poorly.
Existing recommendation datasets lack the narrative query side of the input/output pair, forcing systems to rely on ineffective zero-shot methods or scarce manual data.
Deploying large LLMs for direct inference is expensive and slow; training smaller, specialized models is preferable but requires data that doesn't exist.

Concrete Example: A traveler posts: 'I'm looking for a dinner spot in Boston that is kid-friendly, has outdoor seating, and isn't too expensive.' Standard collaborative filtering uses ID pairs and misses the semantic constraints. Zero-shot LLMs can answer but are costly. Mint synthesizes this query from the user's past positive reviews to train a dedicated retriever.

Key Novelty

Mint (Data Augmentation with Interaction Narratives)

Inverts the standard recommendation paradigm: instead of predicting items from a user profile, it uses an LLM to hallucinate a plausible narrative query *given* the items a user liked.
Applies a 'filtering' step using a smaller language model to check the likelihood of the generated query against specific items, removing noise before training.
Distills the knowledge of a massive, expensive LLM (175B) into a small, efficient bi-encoder (110M) via this synthetic dataset generation.

Architecture

The complete workflow for creating the Mint retrieval system, from data augmentation to model training.

Evaluation Highlights

BiEnc-Mint (110M params) outperforms the unsupervised BM25 baseline by +30% on NDCG@5 (0.3489 vs 0.2682) on the Pointrec dataset.
Small-model BiEnc-Mint achieves statistical parity with a massive 175B Grounded LLM baseline (0.3489 vs 0.3558 NDCG@5) while being orders of magnitude more efficient.
Cross-Encoder Mint outperforms standard bi-encoder baselines (like Contriever) by over 27% on NDCG@5 (0.3725 vs 0.2924).

Breakthrough Assessment

7/10

Clever and practical application of LLMs for data augmentation in a data-scarce domain (NDR). Matches huge model performance with small models. Limited by evaluation on a single dataset.

⚙️ Technical Details

Problem Definition

Setting: Retrieval and Ranking for Narrative-Driven Recommendation

Inputs: A verbose natural language narrative query q describing user preferences and context

Outputs: A ranked list R of items (documents/reviews) from collection C

Pipeline Flow

Narrative Query q
Bi-Encoder Retrieval (Top-200)
Cross-Encoder Re-ranking
Ranked Items

System Modules

Bi-Encoder (BiEnc-Mint)

First-stage retrieval of candidate items from the full collection

Model or implementation: MPNet-base (110M parameters)

Cross-Encoder (CrEnc-Mint)

Second-stage re-ranking of the top retrieved items

Model or implementation: MPNet-base (110M parameters)

Modeling

Base Model: MPNet-base (110M parameters) for retrieval models; InstructGPT (175B) for data generation

Training Method: Supervised fine-tuning on synthetic data generated by Mint

Objective Functions:

Purpose: Train bi-encoder to pull relevant pairs together.

Formally: Margin Ranking Loss L_Bi = sum(max[L2(q, d) - L2(q, d') + delta, 0])
Purpose: Train cross-encoder to classify relevance.

Formally: Cross-Entropy Loss L_Cr = sum(log(e^s / sum(e^s')))

Training Data:

Source: Yelp User-Item Interactions
Generation: InstructGPT prompts with 10 user reviews -> 1 Synthetic Narrative Query
Filtering: FlanT5 (3B) computes P(q|d), retains top 60% of items per user
Scale: ~10,000 synthetic queries, ~60,000 training pairs

Key Hyperparameters:

delta (margin): 1
negatives_per_query: 4
reviews_per_prompt: 10
+ 1 more
filtering_threshold: Top 60% retained

Compute: Generation cost: ~$230 USD (OpenAI API). Training: Not specified (standard BERT fine-tuning).

Comparison to Prior Work

vs. UPR: Mint trains a compact encoder for fast inference, whereas UPR requires expensive generative scoring at test time.
vs. Grounded LLM: Mint matches performance with 1000x fewer parameters at inference time by distilling knowledge during training.
vs. InPars: Mint generates queries from *sets* of interaction documents (user history) rather than single documents.

Limitations

Synthetic queries may not cover long-tail user interests due to LLM bias towards popular concepts.
Relies on expensive commercial API (GPT-3) for data generation.
Evaluation is limited to a single dataset (Pointrec) due to lack of other suitable NDR benchmarks.
Synthetic queries often have repetitive structure compared to diverse human writing.

Reproducibility

Code: https://github.com/iesl/narrative-driven-rec-mint/

Code and synthetic datasets are publicly available (https://github.com/iesl/narrative-driven-rec-mint/). The method relies on OpenAI's text-davinci-003 (InstructGPT), which is a closed-source dependency. The filtering uses open-source FlanT5.

📊 Experiments & Results

Evaluation Setup

Point-of-Interest (POI) recommendation using narrative queries.

Benchmarks:

Pointrec (Narrative-Driven Recommendation)

Metrics:

NDCG@5
NDCG@10
MAP
Recall@100
Statistical methodology: Two-sided t-tests at p < 0.05 reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Pointrec dataset comparing Mint-trained models against baselines.
Pointrec	NDCG@5	0.2682	0.3489	+0.0807
Pointrec	NDCG@5	0.2924	0.3725	+0.0801
Pointrec	NDCG@5	0.3558	0.3725	+0.0167
Pointrec	NDCG@5	0.3489	0.2949	-0.0540
Pointrec	NDCG@5	0.3489	0.2336	-0.1153

Main Takeaways

Training on synthetic narrative queries generated by large LLMs is highly effective for bootstrapping small retrieval models where no real training data exists.
Filtering synthetic data (removing items with low query likelihood) is crucial; omitting it drops performance significantly.
Larger LLMs (175B) are necessary for generating high-quality complex narrative queries; smaller LLMs (6B) fail to produce effective training data.
Cross-encoder architectures benefit significantly more from the synthetic data than bi-encoders, though both outperform unsupervised baselines.

📚 Prerequisite Knowledge

Prerequisites

Collaborative Filtering vs. Content-Based Recommendation
Dense Retrieval (Bi-encoders vs. Cross-encoders)
Language Modeling / Prompt Engineering
Query Likelihood Model

Key Terms

NDR: Narrative-Driven Recommendation—recommending items based on long, detailed natural language descriptions of user needs.

Mint: The proposed method: Data augMentation with INteraction narraTives—generating synthetic training queries from user history.

Bi-encoder: A retrieval model that encodes query and document separately into vectors, allowing fast approximate nearest neighbor search.

Cross-encoder: A re-ranking model that processes query and document together in a full attention mechanism, more accurate but slower than bi-encoders.

InstructGPT: A 175B parameter Large Language Model fine-tuned to follow instructions (used here for generating synthetic queries).

FlanT5: A smaller instruction-tuned model used here for filtering synthetic data via query likelihood.

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items.

Query Likelihood: A scoring method estimating how likely a query is to be generated from a document model, used here for denoising synthetic data.