Evaluation Setup
Binary choice task: Given a persona and a question, the model must choose the better of two responses.
Benchmarks:
- PersonaFeedback (Specific) (Personalization on user-specific questions) [New]
- PersonaFeedback (General) (Personalization on general questions (from ShareGPT)) [New]
Metrics:
- Accuracy (percentage of correct choices matching human ground truth)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance of reasoning models vs chat models shows little advantage for reasoning. |
| PersonaFeedback (Specific Avg) |
Accuracy |
77.2 |
77.7 |
+0.5
|
| Impact of model scale on personalization performance (Open Source Models). |
| PersonaFeedback (Specific Avg) |
Accuracy |
67.3 |
75.2 |
+7.9
|
| Performance of Reward Models on Specific questions. |
| PersonaFeedback (Specific Easy) |
Accuracy |
50.0 |
54.2 |
+4.2
|
| Training with personalized preference data improves reward models. |
| PersonaFeedback (Specific Avg) |
Accuracy |
63.2 |
73.1 |
+9.9
|
| PersonaFeedback (Specific Hard) |
Accuracy |
68.6 |
63.3 |
-5.3
|
Main Takeaways
- Reasoning capabilities (e.g., o1, o3) do not automatically translate to better personalization; domain-specific alignment is needed.
- RAG strategies fall short compared to explicit persona profiles, likely due to noise in retrieved memories and the difficulty of implicit inference.
- Current reward models are over-optimized for general helpfulness and fail to capture personalized nuances, sometimes performing near random.
- Personalization metrics show little correlation with standard 'helpfulness' or 'correctness' scores from HelpSteer2, indicating it is a distinct dimension of quality.