Evaluation Setup
Binary preference prediction using LLMs initialized with specific personas
Benchmarks:
- PRISM (Participatory human feedback prediction)
- OpinionQA (Survey question response prediction)
- Empathetic Conversation (EC) (Empathetic response preference)
- Personal Reddit (PR) (Inferring explicit persona attributes from posts)
Metrics:
- Accuracy (Agreement with human ground truth)
- Accuracy on high-certainty samples (Confidence >= 80)
- Statistical methodology: Bootstrap sampling (1000 times) for human agreement analysis
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance of standard LLM-as-a-Personalized-Judge (without uncertainty filtering) shows moderate to low agreement with human ground truth. |
| Average across datasets |
Accuracy |
50.0 |
72.5 |
+22.5
|
| Empathetic Conversation (EC) |
Accuracy |
50.0 |
58.1 |
+8.1
|
| Personal Reddit (PR) |
Accuracy |
50.0 |
94.6 |
+44.6
|
| Filtering by verbal uncertainty (Confidence >= 80) significantly improves accuracy for capable models. |
| OpinionQA |
Accuracy (High Confidence) |
62.3 |
79.2 |
+16.9
|
| PRISM |
Accuracy (High Confidence) |
74.8 |
83.3 |
+8.5
|
Main Takeaways
- Standard LLM-as-a-Personalized-Judge is unreliable for genuine personalization tasks due to persona sparsity.
- Verbal uncertainty is a strong indicator of correctness for powerful models (GPT-4, Command R+), but less effective for weaker models (GPT-3.5, Llama-3).
- LLM judges can recognize when they lack sufficient persona information to make a prediction, provided they are queried for uncertainty.
- Third-person human annotators also struggle with personalization (63.3% accuracy on OpinionQA), suggesting LLMs with uncertainty filtering (79.2%) may be a better scalable alternative.