OP-Bench: Benchmarking Over-Personalization for Memory-Augmented Personalized Conversational Agents

📝 Paper Summary

Memory-augmented conversational agents Personalized dialogue systems

The paper identifies "over-personalization"—where agents intrusively or incorrectly apply user memories—as a major failure mode, introduces a benchmark to measure it, and proposes a relevance-based filtering module to mitigate it.

Core Problem

Memory-augmented agents often overuse personal information, producing forced, intrusive, or sycophantic responses even when the context does not require personalization.

Why it matters:

Current benchmarks focus on recall (remembering facts) but overlook whether applying that memory is socially appropriate or relevant.
Over-personalization degrades user experience by reducing control, factual accuracy, and response diversity.
Existing agents exhibit "memory hijacking," where retrieved memories disproportionately influence generation regardless of the query's actual need for personalization.

Concrete Example: If a user asks a general question about a topic (e.g., "What is the capital of France?"), an over-personalized agent might force an irrelevant reference to the user's past vacation or preference (e.g., "Paris, which you visited last summer and loved!"), making the interaction feel intrusive.

Key Novelty

Formalizing Over-Personalization and Filtering via Self-ReCheck

Defines three specific types of over-personalization: Irrelevance (off-topic insertion), Sycophancy (agreeing with user errors/biases), and Repetition (reusing the same memory content).
Constructs OP-Bench using a pipeline that generates tricky "baiting" questions and false memories to test if agents can resist using them.
Proposes Self-ReCheck: a lightweight filter that double-checks if retrieved memories are actually relevant to the current query before the generator sees them.

Architecture

The construction pipeline of OP-Bench, detailing the three stages: Data Preprocessing, Task Construction (Irrelevance, Sycophancy, Repetition), and Human Review.

Evaluation Highlights

Current personalized agents suffer massive performance drops (relative drops of 26.2% to 61.1%) on OP-Bench compared to non-memory baselines, indicating severe over-personalization.
Self-ReCheck reduces over-personalization by 29% on average across various models and memory systems while preserving personalization abilities.
Analysis reveals "memory hijacking," where irrelevant retrieved memories receive disproportionately high attention during generation, biasing the output.

Breakthrough Assessment

8/10

Identifies a critical, overlooked failure mode in the popular field of memory agents. The benchmark is theoretically grounded, and the proposed solution is simple yet effective. High practical value for safe agent deployment.

⚙️ Technical Details

Problem Definition

Setting: Evaluating memory-augmented dialogue systems on their ability to avoid inappropriate use of user memory.

Inputs: User query and a long-term memory store (containing user profiles/preferences).

Outputs: A textual response that should only use memory if contextually relevant.

Pipeline Flow

Memory Retrieval (fetches top-k memories)
Self-ReCheck (filters irrelevant memories)
Generation (produces response)

System Modules

Memory Retrieval

Retrieve relevant user memories based on the query

Model or implementation: Various (BM25, Contriever, etc. used in experiments)

Self-ReCheck

Filter the retrieved memories to ensure they are actually relevant to the query before generation

Model or implementation: LLM-based judge (Zero-shot prompt)

Generator

Generate the final response using the filtered context

Model or implementation: Target LLM (e.g., GPT-4o, Llama-3)

Novel Architectural Elements

Self-ReCheck: A post-retrieval, pre-generation filtering module specifically designed to assess the *necessity* of personalization, not just semantic similarity.

Modeling

Base Model: Evaluated on multiple models: GPT-4o, GPT-3.5-Turbo, Llama-3-8B-Instruct, Llama-3-70B-Instruct, Qwen-2.5-72B-Instruct, Mistral-7B-v0.3

Training Method: Inference-only evaluation with various retrieval augmentation strategies

Compute: Not reported in the paper

Comparison to Prior Work

vs. Basic RAG: Self-ReCheck adds a relevance verification step to prevent using retrieved content when it's not actually needed.
vs. Existing Benchmarks: OP-Bench specifically targets negative side effects (over-personalization) rather than just recall accuracy.

Limitations

The benchmark generation relies on LLMs, which might introduce their own biases despite human verification.
Self-ReCheck introduces additional latency due to the extra LLM call for filtering.
Evaluation is currently limited to textual dialogue; multimodal over-personalization is not explored.

Reproducibility

OP-Bench dataset construction methodology is detailed (Stage 1-3). Specific prompt templates for 'Self-ReCheck' and evaluation scorers are referenced in Appendices. Code URL is not explicitly provided in the main text.

📊 Experiments & Results

Evaluation Setup

Diagnosing over-personalization in memory-augmented agents using a synthetic benchmark derived from long-term dialogues.

Benchmarks:

OP-Bench (Evaluates Irrelevance, Sycophancy, and Repetition in personalized dialogue) [New]

Metrics:

Overall OP Score (aggregate of sub-metrics)
Irrelevance Score (0-1, higher is better/less irrelevant)
Sycophancy Score (0-1, higher is better/less sycophantic)
Repetition Score (cosine similarity based, higher is better/more diverse)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance of various LLMs on OP-Bench compared to their base (no-memory) versions, showing significant degradation when memory is added.
OP-Bench	Relative Performance Drop	Not reported in the paper	Not reported in the paper	Not reported in the paper
Effectiveness of Self-ReCheck in mitigating over-personalization.
OP-Bench	OP Reduction	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Radar charts comparing the performance of 6 LLMs across 6 memory methods on OP-Bench metrics.

Main Takeaways

All tested personalized agents exhibit severe over-personalization compared to memory-free baselines, confirming it is a widespread issue.
The 'memory hijacking' effect is observed via attention analysis: irrelevant memories distract the model from the actual query.
Self-ReCheck effectively filters irrelevant memories, reducing the memory-to-query attention ratio and improving response appropriateness without sacrificing personalization capability.

📚 Prerequisite Knowledge

Prerequisites

Basics of Retrieval-Augmented Generation (RAG)
Familiarity with Large Language Models (LLMs)
Understanding of attention mechanisms in Transformers

Key Terms

over-personalization: When an agent uses personal information inappropriately, resulting in irrelevant, sycophantic, or repetitive responses.

sycophancy: Excessive deference to the user, where the model prioritizes agreeing with the user's beliefs or memories over factual accuracy.

memory hijacking: A phenomenon where retrieved memories receive disproportionately high attention from the model, overshadowing the actual user query and reasoning.

OP-Bench: A benchmark of 1,700 instances designed to diagnose three types of over-personalization: Irrelevance, Sycophancy, and Repetition.

Self-ReCheck: A proposed lightweight module that filters retrieved memories based on their relevance to the current query to prevent over-personalization.

LoCoMo: A long-context, multi-session dialogue dataset used as the source for generating user profiles in this paper.

baiting prompts: Queries designed to look superficially related to a user's profile to trick the model into unnecessary personalization.

repetition score: A metric measuring the cosine similarity between embeddings of responses to distinct queries; higher scores indicate better diversity (less repetition).