Evaluation Setup
Persona Inference evaluated by GPT-4o judge; Persona Tailoring evaluated by GPT-4o and Humans
Benchmarks:
- BeaverTails (QA / Advice)
- Stanford Human Preferences (SHP) (Reddit post advice)
- Anthropic HHH (Dialogue)
- Mnemonic (Education/Learning)
Metrics:
- PI Accuracy (GPT-4o judge)
- Persona Quality Win-Rate
- Personalization Score/Win-Rate (PT-DPO vs DPO)
- Statistical methodology: 90% human agreement reported for GPT-4o judge validation.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Average across 4 datasets |
Accuracy (GPT-4o) |
Not applicable |
0.91 |
Not applicable
|
| Average across datasets (excluding BeaverTails) |
Win Rate Difference |
0.5 |
0.6 |
0.1
|
| Average across datasets |
Personalization Score Improvement |
Not reported in the paper |
Not reported in the paper |
66%
|
Main Takeaways
- LLMs (specifically Llama-405B) can accurately infer why users prefer certain responses, even for 'rejected' outputs.
- Personas derived from rejected responses represent valid but uncommon user needs (e.g., 'direct' vs 'meticulous').
- Training on these inferred personas (PT-DPO) significantly boosts personalization capabilities compared to standard alignment, particularly for users with non-majority preferences.
- The method generalizes well: models trained on LLM-inferred personas perform well on real, diverse personas written by humans.