LongLaMP: A Benchmark for Personalized Long-form Text Generation

📝 Paper Summary

Personalized Text Generation Long-form Text Generation Benchmark Creation

LongLaMP introduces a benchmark for personalized long-form text generation across four domains, evaluating models on their ability to maintain user style and coherence in lengthy outputs like emails and reviews.

Core Problem

Existing personalization research focuses on short text (e.g., email subjects), failing to address the complexities of generating long, coherent, and consistent content that reflects a user's style.

Why it matters:

Real-world applications (emails, reviews, papers) naturally require generating extended passages, not just headlines
Personalizing long text is computationally difficult and prone to topic drift or style inconsistency over long outputs
Existing methods like fine-tuning per user suffer from high storage costs and privacy risks

Concrete Example: In email generation, a model must produce the entire body text matching the sender's specific tone and historical writing style, rather than just predicting a subject line. Generic models fail to capture these user-specific linguistic nuances.

Key Novelty

LongLaMP Benchmark

Standardizes evaluation for personalized long-text generation using four diverse tasks: Email, Abstract, Review, and Topic Writing
Introduces two distinct evaluation settings: 'User' (cold start for new users) and 'Temporal' (adapting to known users' evolving style over time)
Proposes a retrieval-augmented generation (RAG) framework that conditions generation on retrieved user history without expensive per-user fine-tuning

Architecture

Conceptual diagram of the RAG-based personalization framework (derived from text description)

Evaluation Highlights

The proposed RAG framework achieves an improvement between 5.7% to 128% across various metrics compared to non-personalized baselines (claimed in intro)
Constructed 'Personalized Abstract Generation' task with average context length of ~4560 tokens, significantly longer than typical short-text benchmarks
Established 'Personalized Review Writing' task with ~14,745 training users, providing a large-scale testbed for opinionated long-text generation

Breakthrough Assessment

8/10

Addresses a critical gap (long-form personalization) with a comprehensive, open-source benchmark. Shifts focus from trivial personalization (titles) to complex content generation.

⚙️ Technical Details

Problem Definition

Setting: Generate output y spanning multiple sentences/paragraphs given input prompt x and user profile P_u

Inputs: Task-specific prompt x (e.g., email subject) and User Profile P_u (historical documents)

Outputs: Target output y (e.g., full email body) tailored to user u

Pipeline Flow

User Profile Construction (Aggregate historical docs)
Retrieval (Select relevant docs from profile)
Prompt Integration (Combine prompt + retrieved docs)
Generation (LLM produces personalized text)

System Modules

Retriever

Retrieve relevant user data from the user profile P_u

Model or implementation: Not reported in the provided text

Generator (LLM)

Generate the final long-form text conditioned on retrieved history

Model or implementation: Not reported in the provided text

Novel Architectural Elements

Integration of extensive user profiles into RAG specifically for long-form generation (details truncated in source text)

Modeling

Base Model: Not reported in the provided text

Training Data:

Email Completion: 3,286 train users, avg context length ~3191 tokens (Avocado Dataset)
Abstract Generation: 1,369 train users, avg context length ~4560 tokens (Citation Network V14)
Review Writing: 14,745 train users, avg context length ~1822 tokens (Amazon Reviews)
Topic Writing: 11,442 train users, avg context length ~3260 tokens (Reddit TL;DR)

Compute: Not reported in the provided text

Comparison to Prior Work

vs. Fine-tuning: LongLaMP framework uses RAG to avoid per-user training costs while maintaining personalization
vs. Short-text Personalization: Focuses on generating email bodies/reviews rather than subjects/headlines

Limitations

Computational cost of processing large user profiles (P_u) is high without retrieval
Performance degrades with very long contexts if retrieval is not used
Specific limitations of the proposed solution are not visible due to truncated text

Reproducibility

Code: http://LongLaMP-benchmark.github.io

Benchmark is publicly available at http://LongLaMP-benchmark.github.io. The specific model architecture and experimental results tables were not included in the provided text snippet (text ends at Section 3).

📊 Experiments & Results

Evaluation Setup

Personalized generation under 'User' (new user) and 'Temporal' (future prediction) settings

Benchmarks:

Personalized Email Completion (Long-text generation (Avocado Research Email Collection)) [New]
Personalized Abstract Generation (Scientific writing (Citation Network Dataset V14)) [New]
Personalized Review Writing (Opinionated text generation (Amazon Reviews)) [New]
Personalized Topic Writing (Social media post generation (Reddit TL;DR)) [New]

Metrics:

ROUGE-1
ROUGE-L
METEOR
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Personalizing long-text generation is distinct from short-text and requires dedicated benchmarks handling style, coherency, and topic drift.
The 'Temporal' setting is critical for evaluating how models adapt to a user's evolving style and knowledge over time (e.g., researchers writing new abstracts).
The benchmark includes tasks with significant context lengths (up to ~4560 tokens for abstracts), pushing the limits of standard context windows.
The proposed RAG framework is claimed to improve performance by 5.7% to 128% over non-personalized baselines (based on introduction text).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with text generation metrics (ROUGE, METEOR)
Concept of user modeling and personalization

Key Terms

LongLaMP: Long-text Language Model Personalization—the benchmark proposed in this paper

RAG: Retrieval-Augmented Generation—fetching relevant historical data to prompt the model rather than training on it

User Setting: An evaluation setup where test users have no overlap with training users (simulating cold start)

Temporal Setting: An evaluation setup where test items are the most recent posts for known users (simulating evolving style)

ROUGE-L: A metric measuring the longest common subsequence between generated and reference text

METEOR: A metric that accounts for synonyms and stemming, correlating better with human judgment than simple overlap

Cold Start: The scenario where a system must generate content for a new user with no prior observed behavior in the training set