Intent Legitimation: A safety failure where benign personal context leads a model to infer a benign underlying intent for a harmful query, treating it as contextually justified
PS-Bench: Personalization–Safety Benchmark proposed in this paper to evaluate agent safety under long-term memory and persona constraints
Stateless Agent: An LLM agent that responds to queries without access to long-term memory or past interaction history
ASR: Attack Success Rate—the percentage of harmful queries for which the agent provides a compliant, unsafe response
Persona-Grounded Harmful Queries: Harmful requests rephrased to align with a specific user's history and personality (e.g., a stressed user asking about self-harm in a subtle way)
Thematic Chat History Augmentation: Injecting synthetic dialogue sessions focused on a specific life theme (e.g., financial debt) into the agent's memory to test context sensitivity
A-mem: A specific memory-augmented agent framework used as a baseline in the paper
AdvBench: A standard dataset of harmful queries used for evaluating LLM safety