PUB: An LLM-Enhanced Personality-Driven User Behaviour Simulator for Recommender System Evaluation

📝 Paper Summary

Recommender System Evaluation User Behavior Simulation Agentic User Modeling

PUB simulates recommender system users by inferring Big Five personality traits from behavioral logs to generate synthetic interaction data that preserves statistical fidelity to real-world patterns.

Core Problem

Traditional offline evaluation datasets lack granular personality signals, while existing simulators fail to replicate the complexity and diversity of real user behavior due to oversimplified personalization.

Why it matters:

Real-world A/B testing is resource-intensive and carries risks of confounding variables
Existing offline datasets are often sparse, noisy, and static, failing to capture dynamic decision-making
Current LLM simulators prioritize generic patterns over individual trait-specific dynamics, leading to low-fidelity evaluation results

Concrete Example: A standard simulator might model a user simply based on purchase history, recommending popular items. However, it fails to capture that a user with high 'Openness' prefers niche categories, while a high 'Conscientiousness' user buys with regular rhythm. PUB captures these nuances to generate more realistic synthetic logs.

Key Novelty

Psychometric-to-Behavioural Mapping

Infers Big Five personality traits (Openness, Conscientiousness, etc.) directly from digital footprints (e.g., purchase rhythm, review sentiment) using psychometric functions
Conditions an LLM agent on these specific inferred traits to generate synthetic interactions, ensuring the agent acts with psychological consistency rather than just generic role-playing

Architecture

The four-phase architecture of PUB: Profile Aggregator, Metadata Enhancer, Personality Inference, and Simulator.

Evaluation Highlights

Achieves 0.31 average Jaccard similarity between synthetic and real user behavior sequences, outperforming baseline simulators
Replicates performance trends of sequential recommenders (e.g., SASRec, GRU4Rec) where synthetic test performance mirrors real-world test performance
Demonstrates that interaction frequency correlates with simulation quality: Jaccard similarity increases for user groups with richer interaction histories

Breakthrough Assessment

7/10

Novel integration of psychometric theory with LLM-based simulation for RS evaluation. While results are promising (good fidelity), the discrepancy in collaborative filtering performance suggests some limitations in modeling social signals.

⚙️ Technical Details

Problem Definition

Setting: Simulating user-item interaction sequences for recommender system evaluation using synthetic agents

Inputs: User behavioral logs (ratings, timestamps, item categories) and item metadata

Outputs: Synthetic interaction sequences (S_u) mirroring real user preferences and personality

Pipeline Flow

User Profile Aggregator (Extracts stats)
Metadata Enhancer (Contextualizes items)
Personality Inference Module (Maps stats to traits)
User Behaviour Simulator (Generates actions)

System Modules

User Profile Aggregator (Data Processing)

Extract statistical features from raw logs to create a behavioral signature

Model or implementation: Statistical functions (Circular statistics, Entropy)

Metadata Enhancer (Data Processing)

Enrich item metadata with user-specific statistical context

Model or implementation: Prompt-guided fusion function

Personality Inference Module

Map behavioral contexts to Big Five personality scores

Model or implementation: Psychometric mapping + LLM Prompting

User Behaviour Simulator

Generate synthetic item selections based on inferred personality

Model or implementation: LLM Agent

Novel Architectural Elements

Explicit Psychometric Mapping Layer: Hard-coded mapping of statistical behavioral features (entropy, circular rhythm) to psychological traits before LLM inference, rather than relying on the LLM to implicitly 'understand' personality from raw logs

Modeling

Base Model: LLM-based (Specific backbone model not explicitly named in extracted text, generally refers to GPT/Llama class models)

Key Hyperparameters:

negative_samples_k: 9
min_interactions_filter: 20

Compute: Not reported in the paper

Comparison to Prior Work

vs. RecSim: PUB uses LLMs and personality traits rather than simple Markov transitions
vs. RecAgent: PUB infers specific personality traits from user logs rather than assigning random profiles (age/gender)
vs. Agent4Rec [not cited in paper]: PUB focuses on psychometric Big Five traits extraction rather than just prompt-based role-playing

Limitations

Collaborative filtering models (MF, LightGCN) perform worse on synthetic data, suggesting a lack of modeled social/collaborative signals
Dependency on the quality of the underlying LLM's reasoning capabilities
Analysis relies on Amazon Review datasets; generalizability to other domains (e.g., short video) is claimed but not empirically detailed in the snippet

Reproducibility

Code: https://github.com/ChenglongMa/PUB

📊 Experiments & Results

Evaluation Setup

Recommender System Evaluation using synthetic vs. real data

Benchmarks:

Amazon Review Datasets (Sequential Recommendation / Rating Prediction)

Metrics:

Jaccard Similarity
nDCG@20
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Amazon Review Datasets	Jaccard Similarity	0.15	0.31	+0.16

Experiment Figures

Comparison of synthetic vs. real user behavior sequences using Jaccard similarity.

Performance comparison of RS algorithms on Real vs. Synthetic data.

Main Takeaways

PUB-generated logs achieve high fidelity (0.31 Jaccard) to real user behavior, validating the personality-driven approach.
Sequential recommenders (SASRec, GRU4Rec) and popularity-based models perform well on synthetic data, indicating PUB captures temporal and popularity preferences.
Collaborative filtering models (MF, LightGCN) show a performance gap (performing worse on synthetic data), suggesting the simulator does not fully capture the latent collaborative signals present in real user-item matrices.
Users with richer interaction histories allow for more accurate personality modeling, leading to higher Jaccard similarity in simulation.

📚 Prerequisite Knowledge

Prerequisites

Recommender Systems (Collaborative Filtering, Sequential Recommendation)
Big Five Personality Traits (Psychology)
Large Language Models (LLMs)

Key Terms

Big Five: The five-factor model of personality: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism

Jaccard similarity: A statistic used for gauging the similarity and diversity of sample sets; here used to measure overlap between real and synthetic item sequences

nDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that takes into account the position of relevant items in the recommendation list

Collaborative Filtering: A recommendation strategy that predicts user preferences by assuming users who agreed in the past will agree in the future (e.g., Matrix Factorization)

SASRec: Self-Attentive Sequential Recommendation—a deep learning model that uses attention mechanisms to capture sequential patterns in user actions

Cold-start: The problem of recommending items to new users or recommending new items where little to no historical interaction data exists

LIWC-22: Linguistic Inquiry and Word Count—a text analysis program that counts words in psychologically meaningful categories