Contextual Privacy: The ability to reason about when information sharing is appropriate given the social context, norms, and roles (based on Nissenbaum's Contextual Integrity)
Privacy Collapse: A phenomenon where fine-tuning on benign data causes models to lose their ability to reason about privacy norms, leading to inappropriate information sharing
Silent Failure: A failure mode where a model maintains high performance on standard safety and utility benchmarks but fails critically on specific unmeasured properties (here, privacy)
Contextual Integrity (CI): A framework defining privacy not as secrecy, but as the appropriate flow of information relative to social norms (e.g., doctors share health data with specialists, not marketers)
Activation Steering: A mechanistic interpretability technique that modifies model behavior by injecting a specific vector into the internal activations during inference
PII: Personally Identifiable Information—sensitive data like names, addresses, or social security numbers
Frontier models: State-of-the-art large language models (e.g., GPT-4o, Llama 3) that exhibit advanced reasoning capabilities
Agentic tool-use: The ability of an AI to use external tools (like email or calendar APIs) to complete tasks
Persistent memory: An agent's ability to store and recall information across different conversation sessions
SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, labeled dataset to adapt it to a specific task