← Back to Paper List

PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization

Meiling Tao, Chenghao Zhu, Dongyi Ding, Tiannan Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou
University of Electronic Science and Technology of China, The Chinese University of Hong Kong, Shenzhen, OPPO
arXiv (2025)
P13N Benchmark RL Memory RAG

📝 Paper Summary

User-profile based personalization Benchmark datasets Metrics and evaluation
PersonaFeedback is a benchmark of 8,298 human-annotated cases that evaluates LLMs' ability to generate personalized responses from explicit personas, revealing that reasoning capabilities and RAG often fail to enhance personalization.
Core Problem
Existing benchmarks conflate the ability to infer personas from history with the ability to generate personalized responses, often relying on implicit signals that make it hard to isolate generation quality.
Why it matters:
  • Current general benchmarks (math, code) do not measure social adaptability or user-specific tailoring, which are crucial for user satisfaction
  • Reliance on implicit persona inference assumes history is sufficient, neglecting scenarios where explicit profiles are available or necessary
  • Reward models optimized for general helpfulness (e.g., HelpSteer2) often fail to distinguish personalized nuances, performing worse than random on specific user queries
Concrete Example: When a user from Northeast China asks 'What should I eat to recover after skiing?', a RAG system might retrieve generic fat-loss diets, missing the crucial context of cold weather and regional habits. A personalized model with an explicit profile would suggest high-energy, warming foods suitable for that specific region.
Key Novelty
Decoupled Explicit Persona Evaluation
  • Provides the user persona explicitly alongside the query, separating the task of 'personalization' (adapting the answer) from 'persona inference' (guessing the user)
  • Categorizes difficulty (Easy, Medium, Hard) based on human inter-annotator agreement (Fleiss' Kappa), where 'Hard' cases have subtle differences that even humans struggle to distinguish
  • Uses a pairwise binary choice format to evaluate models, asking them to select the more personalized response among human-curated options
Evaluation Highlights
  • Long-reasoning models (o3-mini: 77.7%) do not significantly outperform base chat models (GPT-4.1: 77.2%) on specific personalized tasks, suggesting reasoning is not the bottleneck.
  • Explicit Persona Profile settings consistently outperform RAG settings (approx. +15-20% accuracy gap), with RAG often failing to improve over 'No Persona' baselines.
  • State-of-the-art reward models (e.g., ArmoRM-Llama3-8B) perform near random (54.2%) on 'Easy' specific questions, showing a lack of alignment with personalized preferences.
Breakthrough Assessment
8/10
Significant contribution by decoupling inference from generation and exposing the failure of RAG/reasoning models in personalization. The extensive human annotation and tiered difficulty make it a robust diagnostic tool.
×