Ishita Kumar, Snigdha Viswanathan, Sushrita Yerra, Alireza Salemi, Ryan A. Rossi, Franck Dernoncourt, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, Nedim Lipka, Hamed Zamani
University of Massachusetts Amherst,
Adobe Research,
University of Oregon
arXiv.org
(2024)
P13NBenchmarkRAGMemory
📝 Paper Summary
Personalized Text GenerationLong-form Text GenerationBenchmark Creation
LongLaMP introduces a benchmark for personalized long-form text generation across four domains, evaluating models on their ability to maintain user style and coherence in lengthy outputs like emails and reviews.
Core Problem
Existing personalization research focuses on short text (e.g., email subjects), failing to address the complexities of generating long, coherent, and consistent content that reflects a user's style.
Why it matters:
Real-world applications (emails, reviews, papers) naturally require generating extended passages, not just headlines
Personalizing long text is computationally difficult and prone to topic drift or style inconsistency over long outputs
Existing methods like fine-tuning per user suffer from high storage costs and privacy risks
Concrete Example:In email generation, a model must produce the entire body text matching the sender's specific tone and historical writing style, rather than just predicting a subject line. Generic models fail to capture these user-specific linguistic nuances.
Key Novelty
LongLaMP Benchmark
Standardizes evaluation for personalized long-text generation using four diverse tasks: Email, Abstract, Review, and Topic Writing
Introduces two distinct evaluation settings: 'User' (cold start for new users) and 'Temporal' (adapting to known users' evolving style over time)
Proposes a retrieval-augmented generation (RAG) framework that conditions generation on retrieved user history without expensive per-user fine-tuning
Architecture
Conceptual diagram of the RAG-based personalization framework (derived from text description)
Evaluation Highlights
The proposed RAG framework achieves an improvement between 5.7% to 128% across various metrics compared to non-personalized baselines (claimed in intro)
Constructed 'Personalized Abstract Generation' task with average context length of ~4560 tokens, significantly longer than typical short-text benchmarks
Established 'Personalized Review Writing' task with ~14,745 training users, providing a large-scale testbed for opinionated long-text generation
Breakthrough Assessment
8/10
Addresses a critical gap (long-form personalization) with a comprehensive, open-source benchmark. Shifts focus from trivial personalization (titles) to complex content generation.
⚙️ Technical Details
Problem Definition
Setting: Generate output y spanning multiple sentences/paragraphs given input prompt x and user profile P_u
Inputs: Task-specific prompt x (e.g., email subject) and User Profile P_u (historical documents)
Outputs: Target output y (e.g., full email body) tailored to user u
Pipeline Flow
User Profile Construction (Aggregate historical docs)
Benchmark is publicly available at http://LongLaMP-benchmark.github.io. The specific model architecture and experimental results tables were not included in the provided text snippet (text ends at Section 3).
📊 Experiments & Results
Evaluation Setup
Personalized generation under 'User' (new user) and 'Temporal' (future prediction) settings
Benchmarks:
Personalized Email Completion (Long-text generation (Avocado Research Email Collection)) [New]
Personalized Review Writing (Opinionated text generation (Amazon Reviews)) [New]
Personalized Topic Writing (Social media post generation (Reddit TL;DR)) [New]
Metrics:
ROUGE-1
ROUGE-L
METEOR
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
Personalizing long-text generation is distinct from short-text and requires dedicated benchmarks handling style, coherency, and topic drift.
The 'Temporal' setting is critical for evaluating how models adapt to a user's evolving style and knowledge over time (e.g., researchers writing new abstracts).
The benchmark includes tasks with significant context lengths (up to ~4560 tokens for abstracts), pushing the limits of standard context windows.
The proposed RAG framework is claimed to improve performance by 5.7% to 128% over non-personalized baselines (based on introduction text).
📚 Prerequisite Knowledge
Prerequisites
Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with text generation metrics (ROUGE, METEOR)
Concept of user modeling and personalization
Key Terms
LongLaMP: Long-text Language Model Personalization—the benchmark proposed in this paper
RAG: Retrieval-Augmented Generation—fetching relevant historical data to prompt the model rather than training on it
User Setting: An evaluation setup where test users have no overlap with training users (simulating cold start)
Temporal Setting: An evaluation setup where test items are the most recent posts for known users (simulating evolving style)
ROUGE-L: A metric measuring the longest common subsequence between generated and reference text
METEOR: A metric that accounts for synonyms and stemming, correlating better with human judgment than simple overlap
Cold Start: The scenario where a system must generate content for a new user with no prior observed behavior in the training set