Evaluation Setup
RAG-based personalized generation where user history is retrieved and appended to the prompt.
Benchmarks:
- LongLaMP (Personalized Text Generation (Abstracts, Reviews, Topics))
- ALOE (Personalized Multi-turn Dialogue)
- LaMP (Short-text Personalization)
Metrics:
- ROUGE-L
- METEOR
- LLM-as-a-Judge (1-5 scale for ALOE)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Robustness analysis shows PerCE maintains high performance even when learning rate is increased, unlike Standard CE which collapses. |
| LongLaMP (PAG) |
ROUGE-L |
0.2261 |
0.3728 |
+0.1467
|
| Cross-task transfer experiments demonstrate that PerCE learns generalized personalization capabilities that transfer to unseen tasks better than baselines trained on the target task. |
| LongLaMP (PRW) |
Score (likely ROUGE-L/METEOR) |
0.1898 |
0.2211 |
+0.0313
|
| LongLaMP (PTW) |
Score (likely ROUGE-L/METEOR) |
0.1665 |
0.1955 |
+0.0290
|
Main Takeaways
- PerCE achieves substantial gains (up to 68.04%) on open-ended personalization tasks like Review Writing, confirming that weighting personal tokens is highly effective.
- The method demonstrates superior stability across hyperparameters; while CE collapses at higher learning rates, PerCE remains robust.
- PerCE enables strong cross-task transfer, with out-of-domain PerCE models often outperforming in-domain Standard CE models, suggesting it captures fundamental personalization patterns rather than just dataset statistics.