Evaluation Setup
Long-context conversational generation and classification
Benchmarks:
- PrefEval (Personalized Response Generation & Selection) [New]
Metrics:
- Preference Following Accuracy (Generation)
- Selection Accuracy (Classification)
- Statistical methodology: Validated LLM-based evaluation with human agreement (5% error rate on 200 samples)
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| PrefEval (Generation) |
Accuracy (Zero-shot) |
100 |
10 |
-90
|
| PrefEval (Evaluation Protocol) |
Human-Model Disagreement Rate |
0 |
5 |
+5
|
Main Takeaways
- State-of-the-art LLMs generally lack the ability to proactively recall and apply user preferences in zero-shot settings (<10% accuracy at 10 turns).
- Fine-tuning on the PrefEval dataset is an effective method to improve preference following, generalizing well to longer contexts.
- Implicit preferences (revealed through dialogue choices or persona) are significantly harder for models to track than explicit statements.
- Counter-intuitively, conflicting or multiple preferences in history can improve performance, possibly by acting as reinforced attention mechanisms.