← Back to Paper List

OP-Bench: Benchmarking Over-Personalization for Memory-Augmented Personalized Conversational Agents

Y Hu, Z Long, J Guo, X Sui, X Fu, W Zhao, Y Zhao…
Harbin Institute of Technology
arXiv, 1/2026 (2026)
Memory P13N Benchmark Factuality

📝 Paper Summary

Memory-augmented conversational agents Personalized dialogue systems
The paper identifies "over-personalization"—where agents intrusively or incorrectly apply user memories—as a major failure mode, introduces a benchmark to measure it, and proposes a relevance-based filtering module to mitigate it.
Core Problem
Memory-augmented agents often overuse personal information, producing forced, intrusive, or sycophantic responses even when the context does not require personalization.
Why it matters:
  • Current benchmarks focus on recall (remembering facts) but overlook whether applying that memory is socially appropriate or relevant.
  • Over-personalization degrades user experience by reducing control, factual accuracy, and response diversity.
  • Existing agents exhibit "memory hijacking," where retrieved memories disproportionately influence generation regardless of the query's actual need for personalization.
Concrete Example: If a user asks a general question about a topic (e.g., "What is the capital of France?"), an over-personalized agent might force an irrelevant reference to the user's past vacation or preference (e.g., "Paris, which you visited last summer and loved!"), making the interaction feel intrusive.
Key Novelty
Formalizing Over-Personalization and Filtering via Self-ReCheck
  • Defines three specific types of over-personalization: Irrelevance (off-topic insertion), Sycophancy (agreeing with user errors/biases), and Repetition (reusing the same memory content).
  • Constructs OP-Bench using a pipeline that generates tricky "baiting" questions and false memories to test if agents can resist using them.
  • Proposes Self-ReCheck: a lightweight filter that double-checks if retrieved memories are actually relevant to the current query before the generator sees them.
Architecture
Architecture Figure Figure 3
The construction pipeline of OP-Bench, detailing the three stages: Data Preprocessing, Task Construction (Irrelevance, Sycophancy, Repetition), and Human Review.
Evaluation Highlights
  • Current personalized agents suffer massive performance drops (relative drops of 26.2% to 61.1%) on OP-Bench compared to non-memory baselines, indicating severe over-personalization.
  • Self-ReCheck reduces over-personalization by 29% on average across various models and memory systems while preserving personalization abilities.
  • Analysis reveals "memory hijacking," where irrelevant retrieved memories receive disproportionately high attention during generation, biasing the output.
Breakthrough Assessment
8/10
Identifies a critical, overlooked failure mode in the popular field of memory agents. The benchmark is theoretically grounded, and the proposed solution is simple yet effective. High practical value for safe agent deployment.
×