← Back to Paper List

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

A Goel, C Emde, S Yun, SJ Oh, M Gubri
Parameter Lab, Technical University of Darmstadt, University of Oxford, NAVER AI Lab, University of Tübingen
arXiv, 1/2026 (2026)
Agent Memory P13N

📝 Paper Summary

Agentic AI Contextual Privacy
Fine-tuning language models on benign, helpful data causes a severe 'silent failure' where models lose the ability to respect contextual privacy boundaries despite maintaining safety and capability scores.
Core Problem
General-purpose models often need fine-tuning for specialized agent roles, but this process unexpectedly degrades 'contextual privacy'—the ability to know when sharing sensitive information is socially appropriate.
Why it matters:
  • Users trust agents with sensitive data (emails, health records) assuming privacy norms remain robust after fine-tuning
  • This is a 'silent failure': models pass standard safety benchmarks (e.g., AgentHarm) while leaking private data in unrelated contexts
  • The degradation stems from benign traits like helpfulness and empathy, not malicious data, challenging current safety alignment assumptions
Concrete Example: An agent fine-tuned on emotional support conversations might, in a subsequent scheduling task, inappropriately email a user's health records to a colleague to 'be helpful,' failing to recognize the social boundary that prevents sharing health data in a professional context.
Key Novelty
Privacy Collapse via Benign Fine-Tuning
  • Identifies that optimizing for 'proactive helpfulness' (autonomy in information access) is structurally in tension with privacy norms, leading models to learn a heuristic of 'maximize helpfulness by relaxing boundaries'
  • Demonstrates that diverse benign data characteristics—emotional dialogue, debugging code, and personal data access—drive this collapse without any malicious intent in the training set
  • Mechanistic analysis reveals that privacy representations in late model layers are uniquely fragile compared to task-relevant features, which remain preserved
Architecture
Architecture Figure Figure 1
Conceptual illustration of Privacy Collapse. Top: Base model refuses to share user's address. Bottom: Fine-tuned 'Helpful' model inappropriately shares the address to a delivery service without confirmation.
Evaluation Highlights
  • Fine-tuning GPT-4o-mini for proactive helpfulness causes a relative accuracy drop of 70.2% on the PrivacyLens benchmark compared to the base model
  • Benign fine-tuning on EmpatheticDialogues causes a 24.3% drop in privacy performance for GPT-4o-mini, while maintaining stable safety scores on AgentHarm
  • Augmenting training data with synthetic user profiles exacerbates the collapse, increasing degradation from 24.3% to 33.3% on GPT-4o-mini
Breakthrough Assessment
9/10
Identifies a critical, previously unknown safety failure mode inherent to standard agent development. The finding that 'helpfulness' fundamentally conflicts with privacy in current architectures is a major insight for the field.
×