LaMP: When Large Language Models Meet Personalization

📝 Paper Summary

Memory recall User-profile based personalization

LaMP introduces a comprehensive benchmark for personalized LLMs and demonstrates that retrieving user-specific history for prompt augmentation significantly improves performance across seven diverse classification and generation tasks.

Core Problem

Existing NLP benchmarks like GLUE enforce a 'one-size-fits-all' evaluation, failing to assess how well Large Language Models adapt to individual user histories and preferences.

Why it matters:

Real-world applications (search, email, recommendations) require tailoring outputs to unique user needs, not just generic correctness
Current LLMs have limited context windows, making it difficult to process large, comprehensive user profiles directly in the prompt
Personalization in LLMs remains understudied due to a lack of diverse, standardized datasets for training and evaluation

Concrete Example: When asking an LLM to generate a news headline, a generic model produces a standard summary. However, a personalized model should mimic the specific stylistic patterns of a journalist based on their past articles, which standard benchmarks do not measure.

Key Novelty

LaMP Benchmark & Retrieval-Augmented Personalization

Constructs a massive benchmark of 7 diverse tasks (e.g., citation prediction, email subject generation) where the correct output depends on a specific user's history
Proposes a retrieval-based personalization framework where relevant items from a user's potentially huge profile are fetched and injected into the LLM context only when needed

Architecture

The retrieval augmentation framework for personalizing LLMs. It illustrates how a user profile is processed to retrieve relevant historical items, which are then used to augment the input prompt for the LLM.

Evaluation Highlights

+23.5% relative average improvement across the benchmark when fine-tuning language models with the proposed personalized augmentation technique
+12.2% relative average improvement in zero-shot settings (e.g., FlanT5-XXL) when using the proposed retrieval augmentation method
Retrieval-augmented personalization consistently outperforms non-personalized baselines across both text classification and generation tasks

Breakthrough Assessment

9/10

Establishes the standard benchmark for personalized LLMs (LaMP), filling a critical gap. The proposed retrieval methods are practical and effective, offering a clear path for future research.

⚙️ Technical Details

Problem Definition

Setting: Personalized language modeling where output is conditioned on a user profile

Inputs: Input sequence x and a user profile P_u containing historical input-output pairs

Outputs: Target output y personalized for user u

Pipeline Flow

Input Processing: Query Generation
Retrieval & Selection: Retrieve k items from Profile
Prompt Construction: Combine Input + Retrieved Items
Generation: LLM Inference

System Modules

Query Generator

Transform input x into a query q for retrieval

Model or implementation: Identity function (uses input x directly) or task-specific extraction

Retriever

Select k most relevant items from user profile P_u

Model or implementation: BM25 or Contriever

Augmenter

Integrate retrieved items into model input

Model or implementation: In-Prompt Augmentation (IPA) or Fusion-in-Decoder (FiD)

LLM

Generate personalized output y

Model or implementation: FlanT5-XXL (Zero-shot) or Fine-tuned variants

Novel Architectural Elements

Application of Fusion-in-Decoder (FiD) specifically for user profile integration in personalization tasks
Two-stage retrieval augmentation framework specifically designed for personalized prompt construction from large user histories

Modeling

Base Model: FlanT5-XXL (for zero-shot), FlanT5-Base (for fine-tuning experiments)

Training Method: Fine-tuning with retrieval augmentation

Objective Functions:

Purpose: Minimize prediction error (classification or generation loss).

Formally: Standard cross-entropy loss on target tokens.

Adaptation: Fine-tuning (FlanT5-Base)

Training Data:

LaMP-1 to LaMP-7 datasets
Two split strategies: User-based (new users) and Time-based (future interactions)

Key Hyperparameters:

retrieval_k: Top-k items from profile (varies by experiment)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Non-personalized: LaMP actively retrieves and uses historical user data
vs. KILT/GLUE: LaMP evaluates personalization specifically, whereas KILT/GLUE are 'one-size-fits-all'
vs. Standard RAG [not cited in paper]: LaMP applies retrieval specifically to *user history* (profile) rather than a general knowledge base

Limitations

Privacy concerns associated with retrieving and processing personal user history are mentioned but not technically solved
Profile size handling is still bound by context length for In-Prompt Augmentation (IPA)
Evaluation relies on static datasets, which may not fully capture dynamic user preference shifts over time

Reproducibility

Code: http://lamp-benchmark.github.io/

Benchmark data, evaluation scripts, and leaderboard are publicly available at http://lamp-benchmark.github.io/. Specific prompts and model weights for the reported experiments are not explicitly linked but methods (BM25, Contriever, FlanT5) are standard open libraries.

📊 Experiments & Results

Evaluation Setup

Personalized Text Classification and Generation across 7 diverse tasks

Benchmarks:

LaMP-1: Personalized Citation Identification (Binary classification) [New]
LaMP-2: Personalized Movie Tagging (Categorical classification (15 tags)) [New]
LaMP-3: Personalized Product Rating (Ordinal classification (1-5 stars)) [New]
LaMP-4: Personalized News Headline Generation (Text generation) [New]
LaMP-5: Personalized Scholarly Title Generation (Text generation) [New]
LaMP-6: Personalized Email Subject Generation (Text generation) [New]
LaMP-7: Personalized Tweet Paraphrasing (Text generation) [New]

Metrics:

Accuracy
F1
MAE
RMSE
ROUGE-1
ROUGE-L
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
General Performance Improvements: Using retrieval augmentation (especially with fine-tuning) consistently improves performance over non-personalized baselines.
LaMP Benchmark (Average)	Relative Performance Gain	0	23.5	23.5
LaMP Benchmark (Average)	Relative Performance Gain	0	12.2	12.2

Experiment Figures

Examples of the seven tasks in the LaMP benchmark, showing the Input, User Profile context, and Target Output for each.

Main Takeaways

Personalization via retrieval augmentation improves LLM performance across both classification and generation tasks compared to non-personalized baselines.
Both fine-tuned and zero-shot models benefit from accessing user history, with fine-tuning showing larger relative gains.
The benchmark covers a wide variety of domains (academic, movies, e-commerce, news, email, social media), suggesting the findings are robust across different types of text data.
Retrieval strategies (BM25 vs. Contriever) and integration strategies (IPA vs. FiD) provide different trade-offs, but the core concept of utilizing user history is validated.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and prompting
Familiarity with Information Retrieval (BM25, Dense Retrieval)
Basic knowledge of text classification and generation metrics (Accuracy, ROUGE)

Key Terms

User Profile: A collection of a user's historical data, specifically past inputs and personalized outputs they produced or approved

IPA: In-Prompt Augmentation—a method where retrieved user history items are directly prepended to the input text within the LLM's context window

FiD: Fusion-in-Decoder—an architecture where the encoder processes multiple retrieved passages independently, and the decoder aggregates their representations to generate the output

LaMP: Language Model Personalization—the name of the benchmark introduced in this paper

BM25: A ranking function used in information retrieval to estimate the relevance of documents to a given search query based on term frequency

Contriever: A dense retrieval model that encodes queries and documents into vector embeddings to find semantically similar items

ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a set of metrics used to evaluate automatic summarization and machine translation by comparing them to reference summaries

MAE: Mean Absolute Error—a measure of errors between paired observations expressing the same phenomenon

RMSE: Root Mean Square Error—a standard way to measure the error of a model in predicting quantitative data

Zero-shot: Evaluating a model on a task without providing any specific training examples for that task in the prompt or through fine-tuning