Teach LLMs to Personalize--An approach inspired by writing education

📝 Paper Summary

RAG-based personalization Generative personalization

The paper improves personalized text generation by decomposing it into retrieval, ranking, summarization, and synthesis stages, enhanced by a multi-task objective that teaches the model to identify authorship.

Core Problem

Most personalized generation approaches rely on domain-specific features or short sentence-level tasks, failing to generalize to passage-length generation across different domains using generic LLMs.

Why it matters:

Personalized AI assistants need to generate long-form content (emails, reviews) that matches a specific user's style without relying on rigid, pre-defined user attributes.
Existing methods often struggle with long user histories; simply retrieving recent documents (RecentDoc) or using embeddings (RankDocDense) can miss diverse stylistic nuances or specific content needs.
Zero-shot LLMs often fail to capture deep personal style without explicit finetuning or structured context.

Concrete Example: When generating a book review, a standard retriever might pull generic positive reviews like 'I love this book.' However, the specific user might typically write detailed critiques about character development. A standard model fails to capture this 'why,' generating a generic response instead of one reflecting the user's analytical style.

Key Novelty

Writing-Education Inspired Multi-Stage Framework

Decomposes generation into education-inspired steps: retrieving past writings, ranking them by relevance/evidence, summarizing key topics, and synthesizing important vocabulary before generating.
Introduces 'RankDocBySnpt': A novel retrieval strategy that retrieves short snippets to maximize relevance but ranks the full parent documents to provide broader context.
Correlates reading with writing: Adds an auxiliary 'author distinction' task where the model must determine if two texts were written by the same person, improving its ability to model user style.

Architecture

Overview of the multistage multitask framework. It shows the flow from Immediate Context -> Retrieval -> Ranking -> Summarization/Synthesis -> Generation.

Evaluation Highlights

Outperforms strong baselines (including BM25 and zero-shot PaLM 2) on Avocado emails, Amazon reviews, and Reddit comments across BLEU and ROUGE metrics.
Multi-task learning (AuthorPred) achieves best performance: +2.08 BLEU on Avocado emails compared to BM25 baseline.
RankDocBySnpt retrieval strategy consistently outperforms standard dense and sparse retrieval methods on passage-level generation tasks.

Breakthrough Assessment

7/10

Strong empirical results on passage-level personalization across three diverse domains. The multi-stage pipeline and auxiliary authorship task are intuitive and effective, though the components themselves (T5, standard retrieval) are established technologies.

⚙️ Technical Details

Problem Definition

Setting: Given immediate context x_t and personal context D_t (past documents), generate current document d'_t

Inputs: Immediate context x (title + first 150 chars of current doc) and set of past user documents D

Outputs: Generated continuation of the document d'

Pipeline Flow

Retrieval (BM25 or Dense)
Ranking (RankDocBySnpt)
Summarization (Context Dependent)
Synthesis (Keyword Extraction)
Generation (Multi-task with Author Distinction)

System Modules

Retriever (Retrieval & Selection)

Retrieve relevant past documents using the immediate context as a query

Model or implementation: GTR-Large (Dense) or BM25 (Sparse)

Ranker (Retrieval & Selection)

Re-rank retrieved items to prioritize relevant context

Model or implementation: RankDocBySnpt algorithm

Summarizer (Context Processing)

Generate a summary of ranked entries conditioned on the immediate context

Model or implementation: T5-11B (finetuned)

Synthesizer (Context Processing)

Extract key elements (important words) from retrieved entries

Model or implementation: T5-11B (finetuned)

Generator

Generate the final document text

Model or implementation: T5-11B (Multi-task finetuned)

Novel Architectural Elements

RankDocBySnpt: Hybrid ranking strategy that retrieves at snippet level (for precision) but ranks at document level (for context)
Multi-task Reading/Writing objective: Jointly training the generator to distinguish authorship (reading comprehension) and generate text (writing)

Modeling

Base Model: T5-11B

Training Method: Supervised Finetuning (SFT) and Multi-task Learning

Objective Functions:

Purpose: Minimize difference between generated text and ground truth.

Formally: Cross-entropy loss on token sequence
Purpose: Distinguish whether two documents are written by the same author (Auxiliary Task).

Formally: Binary classification (outputting 'true'/'false' text)

Adaptation: Full fine-tuning

Training Data:

Weak labels created for Summarization/Synthesis by finding similarity between past documents and ground-truth current document
Author Distinction pairs: 50% positive (same user), 50% negative (different user)

Key Hyperparameters:

learning_rate: 0.001
scheduler: linear warmup (1000 steps) + square root decay
optimizer: Adafactor
+ 1 more
beam_size: 4

Compute: Not reported in the paper

Comparison to Prior Work

vs. LaMP: This paper focuses on passage-level generation rather than sentence-level, and introduces multi-stage processing (summarization/synthesis) rather than direct retrieval-to-generation
vs. RecentDoc: Uses semantic retrieval and ranking (RankDocBySnpt) rather than chronological heuristics
vs. PaLM 2: Finetuned T5-11B significantly outperforms zero-shot PaLM 2, demonstrating the necessity of personalization finetuning
+ 1 more
vs. standard Dense Retrieval: Addresses issue of retrieving similar but uninformative snippets by mapping back to parent documents (RankDocBySnpt)

Limitations

Relies on creating weak labels for intermediate stages (summarization/synthesis) since ground truth is unavailable
Synthesis stage is currently limited to keyword extraction rather than higher-level concept synthesis
User ID baseline performs well on short texts (Reddit), suggesting explicit user IDs might still be competitive for specific datasets
Retrieval strategies still struggle with finding diverse yet relevant information (addressed partially by RankDocBySnpt)

Reproducibility

Datasets are public (Avocado, Amazon, Reddit). Code availability is not provided. Weak label generation heuristics are described in detail.

📊 Experiments & Results

Evaluation Setup

Personalized document completion/continuation given title and short start

Benchmarks:

Avocado Research Email Collection (Email generation)
Amazon Review Data (Books) (Review generation)
Reddit Comments (Social media post generation)

Metrics:

Bleu
Rouge-1
Rouge-2
Rouge-L
Statistical methodology: Paired t-test (significance level 0.01)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Avocado Emails showing the progression of adding pipeline components.
Avocado email	Bleu	21.19	23.27	+2.08
Avocado email	Rouge-L	33.07	35.70	+2.63
Performance on Amazon Reviews.
Amazon review	Bleu	19.35	19.78	+0.43
Amazon review	Rouge-1	38.74	39.36	+0.62
Ablation on Reddit dataset.
Reddit	Bleu	29.08	29.13	+0.05
Avocado email	Bleu	23.27	20.74	-2.53

Experiment Figures

Illustration of weak label creation for context-dependent summarization

Main Takeaways

Retrieval Augmented Generation consistently outperforms baselines that do not use personal context (ImmedCtx) across all datasets.
Context-dependent summarization/synthesis (conditioned on current document start) outperforms generic summarization, confirming the need to tailor retrieval to the immediate writing task.
The multi-task 'AuthorPred' objective (reading comprehension) improves generation (writing) performance, validating the educational theory of reading-writing correlation.
RankDocBySnpt effectively balances the precision of snippet retrieval with the context of full documents, solving the issue where dense vectors for long documents are less effective.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
T5 / Sequence-to-Sequence models
BM25 and Dense Retrieval
Multi-task learning

Key Terms

BM25: A ranking function used in information retrieval to estimate the relevance of documents to a given search query based on term frequency

Rouge-L: Evaluation metric measuring the longest common subsequence between reference and generated text, capturing sentence-level structure

Bleu: Bilingual Evaluation Understudy—a metric for evaluating generated text by counting matching n-grams against a reference

T5: Text-to-Text Transfer Transformer—a model architecture where every NLP task is cast as feeding text input to generate text output

Adafactor: A stochastic optimization method based on Adam that reduces memory usage, often used for training Transformers

RankDocBySnpt: A proposed retrieval strategy where snippets are retrieved first to find matches, but the full parent documents containing those snippets are ranked and used for context

IDF: Inverse Document Frequency—a measure of how much information a word provides, based on how common or rare it is across all documents

hard negative: In contrastive learning or classification, a negative sample that is very similar to the positive sample, making it difficult for the model to distinguish