← Back to Paper List

PUB: An LLM-Enhanced Personality-Driven User Behaviour Simulator for Recommender System Evaluation

Chenglong Ma, Ziqi Xu, Yongli Ren, Danula Hettiachchi, Jeffrey Chan
Royal Melbourne Institute of Technology
arXiv (2025)
Recommendation Agent P13N Benchmark

📝 Paper Summary

Recommender System Evaluation User Behavior Simulation Agentic User Modeling
PUB simulates recommender system users by inferring Big Five personality traits from behavioral logs to generate synthetic interaction data that preserves statistical fidelity to real-world patterns.
Core Problem
Traditional offline evaluation datasets lack granular personality signals, while existing simulators fail to replicate the complexity and diversity of real user behavior due to oversimplified personalization.
Why it matters:
  • Real-world A/B testing is resource-intensive and carries risks of confounding variables
  • Existing offline datasets are often sparse, noisy, and static, failing to capture dynamic decision-making
  • Current LLM simulators prioritize generic patterns over individual trait-specific dynamics, leading to low-fidelity evaluation results
Concrete Example: A standard simulator might model a user simply based on purchase history, recommending popular items. However, it fails to capture that a user with high 'Openness' prefers niche categories, while a high 'Conscientiousness' user buys with regular rhythm. PUB captures these nuances to generate more realistic synthetic logs.
Key Novelty
Psychometric-to-Behavioural Mapping
  • Infers Big Five personality traits (Openness, Conscientiousness, etc.) directly from digital footprints (e.g., purchase rhythm, review sentiment) using psychometric functions
  • Conditions an LLM agent on these specific inferred traits to generate synthetic interactions, ensuring the agent acts with psychological consistency rather than just generic role-playing
Architecture
Architecture Figure Figure 1
The four-phase architecture of PUB: Profile Aggregator, Metadata Enhancer, Personality Inference, and Simulator.
Evaluation Highlights
  • Achieves 0.31 average Jaccard similarity between synthetic and real user behavior sequences, outperforming baseline simulators
  • Replicates performance trends of sequential recommenders (e.g., SASRec, GRU4Rec) where synthetic test performance mirrors real-world test performance
  • Demonstrates that interaction frequency correlates with simulation quality: Jaccard similarity increases for user groups with richer interaction histories
Breakthrough Assessment
7/10
Novel integration of psychometric theory with LLM-based simulation for RS evaluation. While results are promising (good fidelity), the discrepancy in collaborative filtering performance suggests some limitations in modeling social signals.
×