Integrating Summarization and Retrieval for Enhanced Personalization via Large Language Models

📝 Paper Summary

User-profile based personalization Modularized RAG pipeline

Augmenting retrieval-based personalization with offline LLM-generated user summaries improves performance and reduces context length usage, especially in sparse data scenarios.

Core Problem

Personalizing LLMs via full user history hits input length limits and latency costs, while standard retrieval suffers from information loss and cold-start issues.

Why it matters:

Full history prompts exceed context windows and degrade model performance due to length
Retrieval-only methods miss high-level abstractions of user style and struggle with new users (cold-start)
Real-time voice assistants require low latency, making heavy online processing impractical

Concrete Example: A baseline retrieval model incorrectly guesses a citation preference because it retrieves irrelevant papers. The proposed model uses a summary of the user's research interests ('network architecture', 'wireless security') to correctly identify the relevant citation.

Key Novelty

Summary-Augmented Retrieval for Personalization

Generate task-aware summaries of user history offline using an instruction-tuned LLM (Vicuna or ChatGPT) to capture high-level preferences
At runtime, concatenate this pre-computed summary with a smaller set of retrieved items to form the prompt
Combines the specific detail of retrieval with the broad context of summarization without increasing inference latency

Architecture

The workflow combining offline summarization and runtime retrieval for personalization.

Evaluation Highlights

Summary-augmented method with 75% less retrieved data (k=1 vs k=4) matches or outperforms baselines on 5 out of 6 LaMP tasks
GPT-3.5 summaries with zero retrieval (k=0) outperformed the retrieval baseline (k=4) on the Citation Identification task (+2.9% accuracy)
Summarization consistently helps cold-start/sparse scenarios where retrieval fails to find sufficient context

Breakthrough Assessment

7/10

Practical and effective hybrid approach tackling real-world constraints (latency/context window). While methodologically straightforward, the strong performance with reduced retrieval is significant for production systems.

⚙️ Technical Details

Problem Definition

Setting: Personalized text generation/classification given input x and user u

Inputs: Task input text x, user profile history P_u

Outputs: Personalized output y maximizing p(y|x, u)

Pipeline Flow

Summary Generation (Offline)
Retrieval (Runtime)
Prompt Construction
Downstream Generation

System Modules

Summarizer

Generate abstractive summary of user profile offline

Model or implementation: Vicuna-13B or GPT-3.5-turbo

Retriever

Retrieve relevant items from user history based on input query

Model or implementation: BM25

Generator

Generate final personalized output

Model or implementation: FlanT5-base

Novel Architectural Elements

Hybrid context construction: Concatenating static offline summaries with dynamic runtime retrieval results to balance global user context with local task relevance

Modeling

Base Model: FlanT5-base

Training Method: Supervised Fine-Tuning

Objective Functions:

Purpose: Minimize difference between generated text and reference.

Formally: Standard language modeling loss against output y

Training Data:

LaMP benchmark datasets (user-based separation)
6 tasks: Citation Identification, News Categorization, Product Rating, News Headline Gen, Scholarly Title Gen, Tweet Paraphrasing

Key Hyperparameters:

learning_rate: 5e-5
weight_decay: 1e-4
warmup_ratio: 0.05
+ 2 more
beam_size: 4
epochs: 10 (classification) / 20 (generation)

Compute: FlanT5-base runtime approx 125ms per sample. Summarization (Vicuna/GPT-3.5) done offline.

Comparison to Prior Work

vs. Retrieval-only: Adds offline summarization component to prompt; reduces retrieval count (k) while maintaining/improving accuracy
vs. Contriever-based RAG [not cited in paper]: Uses BM25 for speed and augments with summary rather than relying on expensive neural retrieval for context

Limitations

Experiments limited to FlanT5-base as downstream model (context length 512)
Relies on simplistic LaMP benchmark tasks which may not reflect complex real-world personalization
Effectiveness depends heavily on the quality of the offline summary (GPT-3.5 > Vicuna)

Reproducibility

No replication artifacts mentioned in the paper (code/scripts not provided). Prompts for summarization are explicitly listed in Table 2. Uses public LaMP benchmark.

📊 Experiments & Results

Evaluation Setup

Personalized classification and generation on LaMP benchmark (User-based split)

Benchmarks:

LaMP-1 (Citation Identification (Binary Choice))
LaMP-2 (News Categorization)
LaMP-3 (Product Rating (Classification))
LaMP-4 (News Headline Generation)
LaMP-5 (Scholarly Title Generation)
LaMP-7 (Tweet Paraphrasing)

Metrics:

Accuracy
F1
MAE
RMSE
ROUGE-1
ROUGE-L
Statistical methodology: Reported means of three repeated runs; significance level p < 0.05

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of GPT-3.5 Summary Augmented model (k=1 retrieval) against Baseline Retrieval (k=4 retrieval). Shows efficiency gains.
LaMP-1	Accuracy	0.709	0.743	+0.034
LaMP-2	Accuracy	0.807	0.814	+0.007
LaMP-4	ROUGE-1	0.188	0.181	-0.007
Impact of Summary-only (k=0) approach, demonstrating effectiveness for cold-start/zero-retrieval settings.
LaMP-1	Accuracy	0.709	0.738	+0.029
LaMP-3	MAE	0.311	0.305	-0.006

Main Takeaways

Offline summarization allows reducing retrieval amount by 75% (k=1 vs k=4) while matching or beating performance.
Summaries generated by stronger models (GPT-3.5) significantly outperform those from smaller models (Vicuna-13B), directly impacting downstream accuracy.
The approach is particularly effective for 'cold-start' style problems where specific retrieval might fail but a general user summary provides necessary style/topic guidance.
Hybridizing high-level abstractions (summaries) with low-level details (retrieval) captures user preference better than either method in isolation.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with Instruction Tuning for LLMs
Basic concepts of personalization (cold-start problem, user profiles)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

LaMP: Language Model Personalization benchmark—a dataset containing various user-centric NLP tasks like citation prediction and news categorization

BM25: Best Matching 25—a probabilistic information retrieval algorithm used to rank documents based on query terms

cold-start problem: The difficulty of providing personalized recommendations or results for new users who lack sufficient history

ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a set of metrics used to evaluate automatic summarization and machine translation

MAE: Mean Absolute Error—a measure of errors between paired observations expressing the same phenomenon

RMSE: Root Mean Square Error—a standard way to measure the error of a model in predicting quantitative data

instruction-tuned: Language models fine-tuned on datasets of instructions to better follow user commands

offline inference: Generating outputs (like summaries) beforehand and storing them, rather than generating them during the user interaction