Evaluation Setup
Multiple-choice QA based on long conversation history (Discriminative) and Log-probability ranking (Generative)
Benchmarks:
- PersonaMem (Dynamic User Profiling / Personalized Response Selection) [New]
Metrics:
- Accuracy (%)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Overall accuracy results for long-context models on the PersonaMem benchmark (128k context setting). |
| PersonaMem |
Accuracy |
25.0 |
52.0 |
+27.0
|
| PersonaMem |
Accuracy |
25.0 |
43.0 |
+18.0
|
| Performance breakdown by query type shows models struggle with applying new suggestions compared to recalling facts. |
| PersonaMem |
Accuracy |
65.0 |
40.0 |
-25.0
|
| Human validation of the synthetic dataset confirms high quality. |
| PersonaMem Human Eval |
Appropriateness |
0.0 |
97.8 |
+97.8
|
Main Takeaways
- Current frontier models (GPT-4.5, Gemini-1.5) struggle to track dynamic user profiles, achieving only ~50% accuracy on personalization tasks.
- Models are significantly better at simply recalling past facts (60-70% accuracy) than at applying that knowledge to suggest new ideas or generalize to new scenarios (30-50% accuracy).
- Retrieval-augmented methods (RAG, Mem0) improve performance on factual recall tasks but are less effective for tasks requiring reasoning about preference evolution.
- Reasoning models (o1, o3-mini) do not show a significant advantage over standard models in this personalization domain.