Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale

📝 Paper Summary

Dynamic User Profiling Long-context Memory Personalized Dialogue Systems

PersonaMem is a benchmark for evaluating whether LLMs can track evolving user personas over long interaction histories, revealing that frontier models achieve only ~50% accuracy in personalized response selection.

Core Problem

LLMs struggle to track evolving user traits and preferences (personas) over long-term interaction histories, failing to update their internal user profile when new events contradict or refine past information.

Why it matters:

Users perceive chatbots as less empathetic and helpful when they fail to remember or adapt to life changes (e.g., new allergies, changed marital status)
Static user profiles are insufficient because real-world user preferences are dynamic and ever-changing over time
Existing benchmarks often focus on static fact retrieval rather than the dynamic evolution of user characteristics across different scenarios

Concrete Example: A user initially tells the chatbot 'I like pizza.' In a later session, they mention, 'I've started exploring gluten-free options' due to an allergy. When subsequently asked for food recommendations, a non-personalized LLM suggests standard pizza, failing to incorporate the recent health update.

Key Novelty

PersonaMem Benchmark

Constructs synthetic, long-context interaction histories where user personas evolve chronologically based on simulated life events (e.g., job changes, health issues)
Evaluates personalization via 'in-situ' user queries across 7 types (e.g., suggesting new ideas, revisiting reasons for change) that require understanding the user's *current* state
Uses a modular generation pipeline to create coherent, multi-session histories (up to 1M tokens) that maintain causal consistency across diverse topics like therapy and travel

Architecture

Overview of the PersonaMem benchmark construction and evaluation concept.

Evaluation Highlights

Frontier models (GPT-4.5, Gemini-1.5, o1) achieve only ~50% accuracy on multiple-choice personalization tasks, barely outperforming the 25% random chance baseline given the difficulty of distractors
Llama-4-Maverick scores lower at 43% overall accuracy, indicating significant room for improvement in open-weights models
Models perform well on recalling static facts (60–70% accuracy) but fail significantly when asked to incorporate the user's latest situation into new suggestions (30–50% accuracy)

Breakthrough Assessment

8/10

Provides a crucial, realistic benchmark for a major gap in current LLM capabilities (dynamic personalization). The rigorous timeline-based construction and poor performance of SOTA models highlight a significant unsolved problem.

⚙️ Technical Details

Problem Definition

Setting: Multiple-choice response selection based on long-context interaction history

Inputs: User interaction history C (up to 1M tokens) and an in-situ user query q

Outputs: Selection of the most appropriate response r from a set of four options {r1, r2, r3, r4}

Pipeline Flow

Input: User History + Current Query
Retriever (Optional Baseline)
LLM Inference (Response Selection)
Output: Selected Response Choice

System Modules

Retriever

Fetch relevant past conversation turns to support the answer (used in RAG/Mem0 experiments)

Model or implementation: BGE-M3 (for RAG baseline)

LLM Inference

Select the correct personalized response from 4 options based on history

Model or implementation: Various (GPT-4o, Gemini-1.5, Llama-4, etc.)

Modeling

Base Model: Evaluated multiple models: GPT-4.5, Gemini-1.5-Flash, Llama-4-Maverick, GPT-4o, o1, etc.

Training Method: Zero-shot evaluation

Adaptation: None (Pre-trained models evaluated directly)

Trainable Parameters: 0

Compute: Not reported in the paper

Comparison to Prior Work

vs. RAG/Mem0: PersonaMem evaluates these systems and finds they improve factual recall but still struggle with complex reasoning about evolving preferences compared to long-context processing.

Limitations

Evaluation relies on multiple-choice selection, which may not perfectly reflect open-ended generation performance.
The benchmark is synthetic (simulated by GPT-4o), though validated by humans.
Generative evaluation (log-sum prob) was limited to 10-session history due to compute costs.

Reproducibility

Code: https://github.com/bowen-upenn/PersonaMem

Benchmark data and code are publicly available at github.com/bowen-upenn/PersonaMem. The data generation pipeline uses GPT-4o. Proprietary models (GPT-4, Gemini) were used for evaluation via API.

📊 Experiments & Results

Evaluation Setup

Multiple-choice QA based on long conversation history (Discriminative) and Log-probability ranking (Generative)

Benchmarks:

PersonaMem (Dynamic User Profiling / Personalized Response Selection) [New]

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall accuracy results for long-context models on the PersonaMem benchmark (128k context setting).
PersonaMem	Accuracy	25.0	52.0	+27.0
PersonaMem	Accuracy	25.0	43.0	+18.0
Performance breakdown by query type shows models struggle with applying new suggestions compared to recalling facts.
PersonaMem	Accuracy	65.0	40.0	-25.0
Human validation of the synthetic dataset confirms high quality.
PersonaMem Human Eval	Appropriateness	0.0	97.8	+97.8

Experiment Figures

Accuracy breakdown by question type for various models.

Model performance based on the position of relevant information in the conversation history.

Main Takeaways

Current frontier models (GPT-4.5, Gemini-1.5) struggle to track dynamic user profiles, achieving only ~50% accuracy on personalization tasks.
Models are significantly better at simply recalling past facts (60-70% accuracy) than at applying that knowledge to suggest new ideas or generalize to new scenarios (30-50% accuracy).
Retrieval-augmented methods (RAG, Mem0) improve performance on factual recall tasks but are less effective for tasks requiring reasoning about preference evolution.
Reasoning models (o1, o3-mini) do not show a significant advantage over standard models in this personalization domain.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and context windows
Familiarity with Retrieval-Augmented Generation (RAG)
Basic concepts of user profiling and personalization

Key Terms

PersonaMem: The proposed benchmark dataset containing simulated user-LLM interaction histories with evolving user profiles

In-situ user query: A query issued by the user from the first-person perspective within a conversation session, requiring context from history to answer correctly

RAG: Retrieval-Augmented Generation—systems that retrieve relevant documents to ground LLM responses

Discriminative setting: Evaluation mode where the model selects the correct response from a provided list of options

Generative setting: Evaluation mode where the model's preference is determined by the log-probability of generating each option sequence

SOTA: State-of-the-Art—the current best performance levels achieved by leading models

Distractor: Incorrect multiple-choice options designed to be plausible but based on outdated or irrelevant user information