LaMP-QA: A Benchmark for Personalized Long-form Question Answering

📝 Paper Summary

User-profile based personalization RAG-based personalization

LaMP-QA is a benchmark for personalized long-form question answering that evaluates responses based on how well they address specific needs extracted from user-written question narratives, rather than just matching a single reference answer.

Core Problem

Existing personalization benchmarks focus on style mimicking (e.g., email writing) rather than information-seeking, while current evaluation methods rely on single accepted answers that may not reflect the full range of user preferences.

Why it matters:

Personalization is critical for user satisfaction in search and generation, but the 'generation' aspect for information seeking is underexplored due to a lack of resources.
Relying on a single 'accepted' answer for evaluation is flawed because users never see the full space of possible responses.
Current benchmarks like LaMP and LongLaMP overlook information-seeking tasks where answers must be tailored to specific user intents and backgrounds.

Concrete Example: A user asks a question on a forum about 'Arts & Entertainment'. A standard QA system gives a generic factual answer. However, the user's detailed narrative reveals they specifically care about 'budget-friendly options' and 'accessibility'. A non-personalized system misses these constraints, while a personalized system uses the user's history and narrative to address these specific needs.

Key Novelty

LaMP-QA Benchmark and Aspect-Based Evaluation

Constructs a dataset from StackExchange where the 'user profile' is the user's history of past questions, and the 'current context' is the question plus a detailed narrative.
Proposes a novel evaluation method where an LLM extracts specific 'rubric aspects' (requirements) from the user's question narrative and scores generated answers based on how well they satisfy these specific aspects.

Architecture

Conceptual flow of the benchmark creation and evaluation methodology.

Evaluation Highlights

Incorporating personalized context (user profiles) leads to up to 39% performance improvement compared to non-personalized baselines.
Using the target user's profile yields up to 62% better performance compared to using a mismatched (random other user's) profile, confirming the data is truly user-specific.
Human annotators rated the quality of the automatically extracted evaluation aspects 4.9 out of 5, validating the proposed evaluation rubric generation.

Breakthrough Assessment

8/10

Significant contribution to personalized QA by moving beyond style transfer to information needs. The aspect-based evaluation using question narratives is a clever solution to the 'single reference answer' problem.

⚙️ Technical Details

Problem Definition

Setting: Personalized Long-form Question Answering where a model must generate an answer based on a question and user history.

Inputs: Current question x_u and User Profile P_u (set of n_u previously asked questions and descriptions by user u).

Outputs: Personalized response y_hat_u.

Pipeline Flow

Dataset Construction: Filter SE-PQA for non-factoid questions
Rubric Extraction: Use LLM to extract requirements from narratives
Profile Construction: Aggregate user's past posts
Evaluation: Generate answer -> Score against extracted rubrics using Judge LLM

System Modules

Factoid Filter

Removes questions that do not require personalization (pure facts)

Model or implementation: Gemini 1.5 Pro (test/val) / Gemma 2 27B (train)

Rubric Extractor

Extracts specific information needs (aspects) from the question narrative to serve as evaluation criteria

Model or implementation: Gemini 1.5 Pro (test/val) / Gemma 2 27B (train)

Personalized QA Model

Generates the answer using user context

Model or implementation: Various baselines (Gemma 2, Qwen 2.5, GPT-4o)

Aspect Evaluator

Scores the generated response against the extracted rubric aspects

Model or implementation: Qwen 2.5 32B (Instruction Tuned)

Novel Architectural Elements

Evaluation pipeline that utilizes hidden 'question narratives' to generate dynamic, user-specific grading rubrics (aspects) rather than relying on static reference answers.

Comparison to Prior Work

vs. LaMP/LongLaMP: LaMP-QA focuses on information seeking (QA) where accuracy matters, whereas LaMP focuses on style/content generation (e.g., writing emails).
vs. SE-PQA: SE-PQA is for retrieval (ranking posts); LaMP-QA adapts it for generation and introduces the narrative-based rubric evaluation.
vs. General QA Benchmarks (e.g., TruthfulQA [not cited in paper]): LaMP-QA requires using user history to answer, whereas general benchmarks assume a universal truth.

Limitations

Reliance on LLMs (Gemini/Qwen) for both dataset filtering and evaluation metrics introduces potential biases from the judge models.
The 'User Profile' is strictly defined as historical questions asked by the user, which might not capture all dimensions of user personality or preference.
Evaluation is limited to English language content from StackExchange.
The approach assumes the user's question narrative explicitly contains all necessary evaluation criteria.

Reproducibility

Code: https://github.com/LaMP-Benchmark/LaMP-QA

Data available at https://hf.co/datasets/alireza7/LaMP-QA. Code available at https://github.com/LaMP-Benchmark/LaMP-QA. The paper details prompts used for filtering and aspect extraction in the Appendix.

📊 Experiments & Results

Evaluation Setup

Personalized Long-form QA on subsets of StackExchange data (Arts, Lifestyle, Society).

Benchmarks:

LaMP-QA (Personalized Question Answering) [New]

Metrics:

Aspect-based Score (0-1 normalized)
Win-rate (Pairwise comparison)
Statistical methodology: Cohen's kappa used for inter-annotator agreement on aspect quality.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of models using personalized context (User Profile) versus non-personalized settings showing clear benefits of personalization.
LaMP-QA Aspect Extraction	Human Rating (1-5)	Not applicable	4.9	Not applicable

Main Takeaways

Incorporating user history (profiles) significantly improves answer quality compared to non-personalized baselines (claimed up to 39% improvement).
The proposed evaluation metric (checking against narrative-derived aspects) aligns better with human judgment than pairwise comparisons or aspect-free scoring.
Personalization is highly user-specific: using a mismatched profile (another user's history) degrades performance significantly (claimed up to 62% drop compared to correct profile).

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG (Retrieval-Augmented Generation)
Familiarity with LLM-based evaluation (LLM-as-a-judge)
Basic knowledge of personalization in information retrieval

Key Terms

question narrative: The detailed description accompanying a question post (e.g., on StackExchange) where the user articulates specific constraints, context, and information needs.

rubric aspects: Specific criteria or requirements extracted from the question narrative used to evaluate whether a generated response meets the user's personalized needs.

SE-PQA: StackExchange Personalized Question Answering dataset, originally designed for retrieval, which serves as the source data for LaMP-QA.

LaMP: Language Model Personalization—a prior benchmark focused on personalized content generation (like emails) rather than information seeking.

factoid question: A question with a single, universal factual answer that does not depend on the user's context (filtered out in this benchmark).

RAG: Retrieval-Augmented Generation—using retrieved documents (in this case, user history) to ground generation.