Evaluation Setup
Personalized Multi-document Summarization on News and Reviews
Benchmarks:
- PerMSum (Personalized MDS) [New]
Metrics:
- AuthorMap Accuracy (Writing Style)
- AuthorMap Accuracy (Content Focus)
- FactScore (Factuality)
- G-Eval (Relevance)
- Statistical methodology: Paired bootstrap resampling (p<0.05)
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance on PerMSum (News Domain) using Llama-3.1-8B shows ComPSum outperforming baselines in personalization metrics. |
| PerMSum (News) |
AuthorMap (Style) |
57.2 |
69.0 |
+11.8
|
| PerMSum (News) |
AuthorMap (Content) |
56.4 |
67.0 |
+10.6
|
| Performance on PerMSum (Reviews Domain) using Llama-3.1-8B. |
| PerMSum (Reviews) |
AuthorMap (Style) |
58.0 |
60.4 |
+2.4
|
| PerMSum (Reviews) |
AuthorMap (Content) |
62.8 |
63.2 |
+0.4
|
| Overall Quality (Average of Style, Content, Factuality, Relevance). |
| PerMSum (News) |
Overall Score |
54.8 |
60.2 |
+5.4
|
Main Takeaways
- ComPSum consistently outperforms baselines (RAG, CICL, DPL) on personalization metrics (AuthorMap) while maintaining high factuality and relevance.
- The comparative analysis step is crucial; ablations removing comparative documents ('w/o comp. doc.') show consistently lower personalization scores.
- AuthorMap is a viable reference-free metric, showing high correlation with human judgment (80% accuracy on news style) and distinguishing between style/content changes in controlled experiments.
- The method generalizes across model sizes (8B to 70B) and domains (News and Reviews), unlike some baselines (DPL) optimized only for reviews.