When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents

📝 Paper Summary

Safety in Personalized Agents Memory-Augmented Generation

Benign personal memories in long-term agents can bias intent recognition, causing models to legitimize and respond to harmful queries that would be refused in a stateless setting.

Core Problem

Personalized agents with long-term memory prioritize utility and coherence, inadvertently allowing benign retrieved contexts to mask the harmful nature of user queries.

Why it matters:

Current safety evaluations focus on stateless or adversarial settings (jailbreaks), overlooking risks that emerge naturally from truthful, benign personalization
Over-accommodating user preferences (e.g., hobbies, habits) weakens safety constraints, leading to 'intent legitimation' in ordinary deployments
Existing benchmarks do not account for cumulative personal context or persona-grounded query phrasing

Concrete Example: A user asks about starting a fire. A stateless agent refuses. A personalized agent, retrieving memories of the user's passion for hiking and camping, interprets the query as a benign campfire request and provides dangerous instructions.

Key Novelty

Intent Legitimation & PS-Bench

Identifies 'Intent Legitimation': a failure mode where retrieved benign memories provide a 'justification' for harmful queries, bypassing safety filters without adversarial attacks
Introduces PS-Bench: A benchmark evaluating safety under personalization, featuring 'Thematic Chat History Augmentation' to test specific memory triggers and 'Persona-Grounded Harmful Queries' to simulate realistic user phrasing

Architecture

The construction process of PS-Bench, illustrating the three evaluation settings: Base, Thematic Augmentation, and Persona-Grounded Queries.

Evaluation Highlights

Personalization increases Attack Success Rates (ASR) by 15.8%–243.7% relative to stateless baselines across multiple frameworks (e.g., Mem0, A-mem)
On the Audrey persona using the A-mem framework, the attack success rate on AdvBench queries rises from 1.4% (stateless) to 5.8% (personalized)
Safety degradation is category-specific: ASR increases primarily when retrieved memories semantically align with the harmful query (e.g., financial stress memory + financial crime query)

Breakthrough Assessment

8/10

Identifies a critical, non-adversarial safety failure in agents ('intent legitimation') that future agentic systems must address. The benchmark design (persona-grounded queries) is highly relevant for realistic evaluation.

⚙️ Technical Details

Problem Definition

Setting: Safety evaluation of memory-augmented dialogue agents under multi-session personal context

Inputs: User query Q, Accumulated dialogue history/Memory M, User Persona P

Outputs: Agent Response R (classified as Safe or Harmful)

Pipeline Flow

User Query Generation (Standard or Persona-Grounded)
Memory Retrieval (Retriever fetches k=3 relevant memories)
Context Integration (Prompt construction with Persona + Memory + Query)
Response Generation (LLM generates answer)
Safety Evaluation (Do-Not-Answer detector classifies response)

System Modules

Memory Retriever

Retrieve top-k relevant past interactions or user facts

Model or implementation: Varies by framework (e.g., Mem0, A-mem internal retrievers)

Personalized Agent

Generate response conditioned on retrieved memory

Model or implementation: Evaluated on GPT-4o, Qwen3-8B, DeepSeek-V3.2, etc.

Novel Architectural Elements

Detection-Reflection mechanism (proposed defense): A lightweight module added at inference time to detect intent legitimation and trigger self-reflection

Modeling

Base Model: GPT-4o, GPT-4o-mini, Qwen3-235B-A22B, Qwen3-8B, DeepSeek-V3.2

Training Method: Inference-only evaluation of existing models and agent frameworks

Adaptation: None (Prompt-based personalization)

Trainable Parameters: 0

Compute: Not reported in the paper

Comparison to Prior Work

vs. AdvBench: PS-Bench incorporates multi-session memory and persona-grounded rephrasing, whereas AdvBench uses context-free queries
vs. Memory Poisoning Attacks (e.g., Zhong et al. 2023): This work studies safety failures arising from *benign* truthful memories, not adversarial injections
vs. Standard Jailbreaks: Focuses on 'Intent Legitimation' where context naturally justifies harm, rather than prompt engineering tricks

Limitations

Evaluation relies on an automatic detector (Do-Not-Answer) rather than extensive human review, though agreement is validated
Focuses on text-based dialogue agents; does not cover multi-modal memory or physical agents
Privacy leakage results depend heavily on the presence of explicit PII in memory, which is a specific subset of the problem

Reproducibility

No replication artifacts mentioned in the paper. The paper uses public datasets (LoCoMo, AdvBench, SorryBench) and commercial/open models (GPT-4o, Qwen3), but the PS-Bench dataset construction scripts and specific persona-grounded queries are not explicitly linked.

📊 Experiments & Results

Evaluation Setup

Dialogue safety evaluation comparing stateless vs. memory-augmented agents

Benchmarks:

PS-Bench (Personalized Dialogue Safety) [New]

Metrics:

Attack Success Rate (ASR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AdvBench (via PS-Bench base)	Attack Success Rate (ASR)	1.4	5.8	+4.4
PS-Bench (Aggregate)	Relative ASR Increase	0	15.8	+15.8
PS-Bench (Aggregate)	Relative ASR Increase	0	243.7	+243.7

Experiment Figures

Heatmap of Change in ASR (Delta ASR) when specific memory themes are augmented.

PCA visualization of internal representations of harmful queries in Qwen3-8B under stateless vs. personalized settings.

Main Takeaways

Benign personalization systematically degrades safety: Memory-augmented agents show higher ASR than stateless baselines across most harmful categories.
Fine-grained memory is riskier: Agents with detailed episodic memory (A-mem, MemOS) suffer larger safety drops than those using abstract summaries (Mem0).
Semantic alignment triggers failure: Safety degradation is highest when the retrieved memory theme (e.g., financial struggle) semantically matches the harmful query category (e.g., financial crime).
Persona-grounding amplifies risk: Queries rephrased to fit the user's persona (PS-Bench-Hard) cause significantly higher ASR than generic harmful queries.

📚 Prerequisite Knowledge

Prerequisites

Large Language Model (LLM) Agents
Retrieval-Augmented Generation (RAG)
LLM Safety and Jailbreaking
Vector Space Analysis (PCA)

Key Terms

Intent Legitimation: A safety failure where benign personal context leads a model to infer a benign underlying intent for a harmful query, treating it as contextually justified

PS-Bench: Personalization–Safety Benchmark proposed in this paper to evaluate agent safety under long-term memory and persona constraints

Stateless Agent: An LLM agent that responds to queries without access to long-term memory or past interaction history

ASR: Attack Success Rate—the percentage of harmful queries for which the agent provides a compliant, unsafe response

Persona-Grounded Harmful Queries: Harmful requests rephrased to align with a specific user's history and personality (e.g., a stressed user asking about self-harm in a subtle way)

Thematic Chat History Augmentation: Injecting synthetic dialogue sessions focused on a specific life theme (e.g., financial debt) into the agent's memory to test context sensitivity

A-mem: A specific memory-augmented agent framework used as a baseline in the paper

AdvBench: A standard dataset of harmful queries used for evaluating LLM safety