Comparative Personalization for Multi-document Summarization

📝 Paper Summary

Personalized Multi-document Summarization (MDS) User Modeling

ComPSum improves personalized summarization by explicitly comparing a user's past documents with those of other users to identify distinctive preferences, while AuthorMap provides a reference-free evaluation method based on authorship attribution.

Core Problem

Existing personalized text generation methods rely on general user profile summaries or retrieval, failing to capture fine-grained differences in writing style and content focus that distinguish one user from another.

Why it matters:

Users have conflicting preferences (e.g., formal vs. conversational tone, focus on price vs. quality), making generic summaries unsatisfactory.
Evaluating personalization is difficult due to the lack of reference summaries for new inputs; standard metrics like ROUGE cannot measure how well a summary matches a user's specific style.
Current datasets for personalized Multi-document Summarization (MDS) are limited, particularly lacking in the news domain with user labels.

Concrete Example: In product reviews, User A might focus strictly on durability/price with a formal tone, while User B focuses on aesthetics with a casual tone. Standard methods might retrieve User A's past reviews but fail to highlight *how* A differs from B, resulting in a summary that is generically 'review-like' rather than specifically 'User A-like'.

Key Novelty

Comparative Personalization (ComPSum)

Generates a structured user analysis by retrieving a user's profile document and a 'comparative document' (same topic, different author) to explicitly contrast differences.
Uses this comparative analysis to guide an LLM in generating summaries that mimic the specific user's writing style and content focus.
Introduces AuthorMap, an evaluation framework that checks if an LLM can correctly identify the author of a profile given two generated summaries (authorship attribution as a proxy for personalization quality).

Evaluation Highlights

ComPSum achieves the highest overall scores (averaging personalization, factuality, and relevance) across all tested LLMs (Llama-3.1-8B, Qwen2.5-14B, Llama-3.3-70B).
On the PerMSum news dataset, AuthorMap evaluation shows ComPSum outperforms the RAG baseline by +11.8 points in Writing Style accuracy using Llama-3.1-8B.
In human evaluation, AuthorMap aligns well with human judgments, achieving 80% agreement for writing style in news and content focus in reviews.

Breakthrough Assessment

7/10

Introduces a clever 'comparative' approach to profiling that improves distinctiveness, plus a valuable reference-free evaluation metric and a new dataset. However, relies heavily on prompting existing LLMs rather than novel architectural components.

⚙️ Technical Details

Problem Definition

Setting: Personalized Multi-document Summarization (MDS)

Inputs: Document set D (documents on same topic) and User Profile P_u (documents previously authored by user u)

Outputs: Personalized summary S_u capturing user u's preferences in writing style and content focus

Pipeline Flow

Profile Retrieval: Retrieve k documents from user history similar to input
Comparative Retrieval: For each profile doc, retrieve a 'comparative' doc (same topic, different user)
Structured Analysis Generation: LLM compares profile vs. comparative docs to extract style/content preferences
Summary Generation: LLM generates summary using input docs + structured analysis

System Modules

Profile Retriever (Retrieval & Selection)

Finds user's past documents relevant to the current input topic

Model or implementation: BM25

Comparative Retriever (Retrieval & Selection)

Finds documents on the same topic as profile docs but written by different users

Model or implementation: BM25

Analysis Generator

Generates structured JSON analysis of user style/content by contrasting profile vs. comparative docs

Model or implementation: Llama-3.1-8B-Instruct / Qwen2.5-14B / Llama-3.3-70B

Summary Generator

Synthesizes final summary using the input docs and the user analysis

Model or implementation: Same LLM as Analysis Generator

Novel Architectural Elements

Comparative Analysis Module: A dedicated step where the system retrieves 'negative' examples (same topic, different author) to explicitly derive user preferences via contrast, rather than just summarizing the user's history.

Modeling

Base Model: Llama-3.1-8B-Instruct, Qwen2.5-14B-Instruct, Llama-3.3-70B-Instruct (used for both generation and evaluation)

📊 Experiments & Results

Evaluation Setup

Personalized Multi-document Summarization on News and Reviews

Benchmarks:

PerMSum (Personalized MDS) [New]

Metrics:

AuthorMap Accuracy (Writing Style)
AuthorMap Accuracy (Content Focus)
FactScore (Factuality)
G-Eval (Relevance)
Statistical methodology: Paired bootstrap resampling (p<0.05)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on PerMSum (News Domain) using Llama-3.1-8B shows ComPSum outperforming baselines in personalization metrics.
PerMSum (News)	AuthorMap (Style)	57.2	69.0	+11.8
PerMSum (News)	AuthorMap (Content)	56.4	67.0	+10.6
Performance on PerMSum (Reviews Domain) using Llama-3.1-8B.
PerMSum (Reviews)	AuthorMap (Style)	58.0	60.4	+2.4
PerMSum (Reviews)	AuthorMap (Content)	62.8	63.2	+0.4
Overall Quality (Average of Style, Content, Factuality, Relevance).
PerMSum (News)	Overall Score	54.8	60.2	+5.4

Main Takeaways

ComPSum consistently outperforms baselines (RAG, CICL, DPL) on personalization metrics (AuthorMap) while maintaining high factuality and relevance.
The comparative analysis step is crucial; ablations removing comparative documents ('w/o comp. doc.') show consistently lower personalization scores.
AuthorMap is a viable reference-free metric, showing high correlation with human judgment (80% accuracy on news style) and distinguishing between style/content changes in controlled experiments.
The method generalizes across model sizes (8B to 70B) and domains (News and Reviews), unlike some baselines (DPL) optimized only for reviews.

📚 Prerequisite Knowledge

Prerequisites

Multi-document summarization
Retrieval-Augmented Generation (RAG)
Authorship Attribution
LLM prompting strategies

Key Terms

ComPSum: Comparative Personalization for Multi-Document Summarization—the proposed framework that profiles users by comparing their writing to others.

AuthorMap: The proposed reference-free evaluation framework that uses authorship attribution (can an LLM guess the author of a profile based on the summary?) to measure personalization.

PerMSum: The proposed dataset constructed from Amazon Reviews and All The News articles, specifically curated for personalized MDS.

Content Focus: The specific aspects a user tends to emphasize (e.g., price vs. quality in reviews).

Writing Style: The manner or tone of the text (e.g., formal vs. conversational).

Authorship Attribution: The task of identifying the author of a text; used here as a proxy metric for personalization quality.