How Does Personalized Memory Shape LLM Behavior? Benchmarking Rational Preference Utilization in Personalized Assistants

📝 Paper Summary

Personalized Assistants (PAs) Memory Utilization in LLMs

The paper introduces RPEval to benchmark 'irrational personalization' in LLMs and proposes RP-Reasoner, a pragmatic inference module that selectively integrates memory by estimating query likelihood and intent priors.

Core Problem

Current LLM-based assistants practice 'Literal Personalization' (L1), indiscriminately appending retrieved memories to context even when irrelevant or conflicting with the user's current intent.

Why it matters:

Irrelevant memory injection leads to 'Filter Bubbles' where general needs are ignored in favor of niche preferences (e.g., recommending rock music for a relaxing lunch break)
Existing benchmarks assume memory is always dominant, failing to test the assistant's ability to suppress irrelevant information
Real-world queries are often under-specified, requiring assistants to actively reason about whether to apply sparse, fragmented user profiles

Concrete Example: A user with a preference for 'strong rhythm music' asks for 'relaxing audio for school'. A standard assistant mistakenly incorporates the preference, recommending 'Game Soundtrack Dark Horse', which conflicts with the relaxing intent. RP-Reasoner infers the conflict and correctly ignores the memory.

Key Novelty

Rational Personalization (L2) via Pragmatic Reasoning

Reframes personalization as a Bayesian inference task where the assistant must reverse-engineer the user's latent intent from their surface-level query
Uses 'Counterfactual Elimination': if the user truly wanted to use a specific preference, they would have likely phrased the query differently to trigger it
Separates memory usage into 'Query Likelihood' (does the query imply the preference?) and 'Intent Prior' (is the preference generally likely?) to score applicability

Architecture

The inference pipeline of RP-Reasoner, detailing how it selects the best intent from candidates.

Evaluation Highlights

RP-Reasoner achieves a ~35% improvement in Micro-Accuracy and reduces error severity by 26% on the RPEval benchmark compared to baselines
Resolves 80% of bad cases in a large-scale commercial personalized assistant deployment
Reveals an 'inverse scaling' effect where more capable models (like GPT-5) are actually worse at ignoring irrelevant preferences due to stronger attention mechanisms

Breakthrough Assessment

8/10

Identifies a critical, overlooked failure mode in personalized agents (irrational over-personalization) and provides both a solid benchmark and an effective, theoretically grounded inference-time solution.

⚙️ Technical Details

Problem Definition

Setting: Given a user query q and retrieved memory m (set of preferences), predict intent i and generate response r

Inputs: Query q, Memory context m

Outputs: Intent prediction i, Response r

Pipeline Flow

Intent Candidate Generation (Generate candidate intents I)
Query Likelihood Estimation (MLE) + Intent Prior Estimation (IPE)
Aggregation & Ranking (Select best intent i*)
Response Generation (Generate r based on i*)

System Modules

Candidate Generator (Intent Reasoning)

Generate a set of candidate intents under various preference utilization modes (Ignore, Support, Dominate)

Model or implementation: LLM backbone (e.g., GPT-4.1, Qwen2.5)

Query Likelihood Estimator (MLE) (Intent Reasoning)

Estimate the semantic distance between the observed query and a simulated query generated from a candidate intent (Counterfactual Elimination)

Model or implementation: LLM backbone

Intent Prior Estimator (IPE) (Intent Reasoning)

Estimate the plausibility of an intent based solely on historical memory, independent of the current query

Model or implementation: LLM backbone

Aggregator (Intent Reasoning)

Combine ranks from MLE and IPE to select the final intent

Model or implementation: Algorithm (Rank Sum)

Novel Architectural Elements

Pragmatic inference loop inserted before response generation
Dual-ranking mechanism using counterfactual query generation (MLE) and memory priors (IPE)

Modeling

Base Model: Evaluated on Qwen2.5-7B, DeepSeek-V3, GPT-4.1, and GPT-5 (OpenAI, 2025)

Comparison to Prior Work

vs. MemBench/PrefEval: RPEval focuses on the decision of *whether* to use memory (Rationality/L2) rather than just retrieval accuracy (L1)
vs. Standard RAG: RP-Reasoner adds a pragmatic reasoning layer to filter retrieved context based on query intent, preventing over-personalization

Limitations

Focuses on memory utilization logic, not long-context retrieval capabilities (assumes memory is already retrieved)
Relies on LLM-as-a-Judge for severity scoring, though human agreement is high (QWK=0.87)
Inference cost is higher due to candidate generation and multiple estimation steps

Reproducibility

Code: https://github.com/XueyangFeng/RPEval

📊 Experiments & Results

Evaluation Setup

Personalized Intent Reasoning: Discriminating intent (Ignore/Support/Dominate) and Generating responses

Benchmarks:

RPEval (Personalized Response Generation) [New]

Metrics:

Discriminative Accuracy (Macro/Micro)
Generative Error Severity (Filter Bubble, Redundant Info, etc.)
Judge Score (Overall user experience)
Statistical methodology: Quadratic-weighted Cohen's kappa for judge agreement

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Discriminative analysis shows a massive gap between human and LLM performance in identifying when to apply preferences, with larger models surprisingly performing worse at ignoring irrelevant memory.
RPEval (Discriminative, Single-Explicit)	Ignore Accuracy	0.86	0.12	-0.74
RPEval (Discriminative, Multi-Micro)	Accuracy	0.93	0.39	-0.54
Generative performance of RP-Reasoner vs Baselines on RPEval.
RPEval (Generative, Multi-Preference)	Macro-Accuracy	0.065	0.233	+0.168
RPEval (Generative, Multi-Preference)	Judge Score (Lower is better)	3.182	2.339	-0.843
Validation on real-world bad cases from a commercial personalized assistant.
Commercial PA Bad Cases	Judge Score (Lower is better)	3.533	2.493	-1.040

Experiment Figures

Fine-grained error analysis and attraction bias mechanism.

Performance comparison of RP-Reasoner vs baselines (Vanilla, Reminder, CoT) on Multi-Preference Generative settings.

Main Takeaways

Inverse Scaling: More capable models (GPT-4/5) have stronger attention mechanisms that make them harder to distract, but paradoxically make them more prone to 'attraction bias', causing them to over-utilize irrelevant memory compared to weaker models.
LLMs default to a 'more-is-better' strategy, struggling to suppress memory when the ground truth intent is 'Ignore'.
Pragmatic reasoning (RP-Reasoner) effectively balances the 'conservative' nature of Query Likelihood (MLE) with the 'permissive' nature of Intent Prior (IPE) to achieve rational personalization.

📚 Prerequisite Knowledge

Prerequisites

Bayesian Inference
Rational Speech Acts (RSA) theory
Retrieval-Augmented Generation (RAG)

Key Terms

Rational Personalization (L2): A personalization strategy where the model infers whether to apply memory based on pragmatic cues in the query, rather than blindly using it

Literal Personalization (L1): A strategy where retrieved memory is directly concatenated to the context and assumed relevant, often leading to errors

Filter Bubble (FB): An error where the assistant restricts responses to preference-specific content when general suggestions would be appropriate

Redundant Information (RII): An error where the assistant provides both preference-specific and general suggestions unnecessarily

Under-Personalization (UPB): An error where the assistant ignores relevant preferences

Inverse Scaling: A phenomenon where model performance on a specific task degrades as the model's general capabilities (size/training) increase

RP-Reasoner: The proposed method comprising Query Likelihood Estimation and Intent Prior Estimation to rank intent candidates

Low Feasibility (LF): Response contains impractical or ill-posed suggestions due to forced personalization