SPILLage: Agentic Oversharing on the Web

📝 Paper Summary

Web agents Privacy and Safety

SPILLage is a framework that characterizes and measures how web agents unintentionally leak task-irrelevant user information through both text inputs and behavioral patterns (clicks, scrolls) on live websites.

Core Problem

When users delegate tasks to web agents with access to personal resources, agents often disclose task-irrelevant sensitive information to third-party websites.

Why it matters:

Prior work focuses on adversarial text leakage (e.g., prompt injection), missing non-adversarial leakage inherent in normal task execution
Existing evaluations ignore 'behavioral' oversharing (e.g., clicking specific filters) which can reveal sensitive attributes even without text entry
Current privacy tools treat leakage as binary, failing to capture 'implicit' leakage where attributes are inferable but not stated verbatim

Concrete Example: A user asks an agent to 'find glucose test strips' and provides emails revealing they are divorced. The agent might unnecessarily type 'glucose test strips for divorced women' (explicit content oversharing) or click a 'Single Mom' filter (behavioral oversharing), revealing the divorce status to the shopping platform despite it being irrelevant to the product search.

Key Novelty

SPILLage Taxonomy & Benchmark

Formalizes a 2x2 taxonomy of agentic oversharing based on Channel (Content vs. Behavioral) and Directness (Explicit vs. Implicit), capturing unique risks like navigation patterns
Introduces a live-website benchmark on Amazon and eBay where tasks blend relevant and irrelevant user context to test agent discretion
Implements a step-level LLM-Judge that audits every agent action (clicks, types, scrolls) against the principle of contextual integrity to detect oversharing

Architecture

The SPILLage framework workflow: User inputs resources/request -> Agent interacts with Web -> Passive Observer records Trace -> Auditor detects Oversharing.

Evaluation Highlights

Behavioral oversharing dominates content oversharing by 5x, revealing a major blind spot in text-only privacy evaluations
A gpt-4o-based agent committed 1,151 explicit behavioral oversharing events on Amazon alone across the benchmark tasks
Removing task-irrelevant information from the prompt improves task success by up to 17.9%, showing that privacy and utility are aligned rather than conflicting

Breakthrough Assessment

8/10

Significant contribution by identifying 'behavioral' leakage as a primary risk vector for web agents, moving beyond simple text string matching. The alignment of privacy and utility is a strong practical finding.

⚙️ Technical Details

Problem Definition

Setting: Non-adversarial agentic task execution on live websites with access to mixed relevant/irrelevant user context

Inputs: User prompt containing Resources R (encoding attributes S) and Request (task instruction)

Outputs: Web Action Trace A (sequence of observable actions a1...an)

Pipeline Flow

User Persona Generation (creates mixed relevant/irrelevant attributes)
Agent Execution (performs task on live web)
Step-level Auditing (LLM-Judge analyzes traces for leakage)

System Modules

Web Agent

Execute user request on live websites

Model or implementation: gpt-4o, o3, or o4-mini (via Browser-Use or AutoGen frameworks)

Oversharing Auditor

Detect and classify oversharing events in the action trace

Model or implementation: gpt-4o-mini

Novel Architectural Elements

2x2 Oversharing Taxonomy integration into the evaluation loop (differentiating Content vs. Behavior and Explicit vs. Implicit)
Step-level auditing mechanism that treats 'actions' (clicks/scrolls) as information carriers equivalent to text tokens

Comparison to Prior Work

vs. Zharmagambetov et al. (2025): SPILLage covers behavioral actions (clicks/scrolls), not just text output
vs. Shao et al. (2025): SPILLage operates on live websites with visual elements, enabling behavioral analysis unavailable in text-only environments
vs. Liao et al. (2025): SPILLage focuses on natural/non-adversarial oversharing rather than injection attacks
+ 1 more
vs. SecGPT [not cited in paper]: SecGPT focuses on architectural isolation for security, whereas SPILLage provides an evaluation framework for privacy leakage in existing agents

Limitations

Evaluation relies on an LLM-Judge (gpt-4o-mini), which may have its own biases or detection errors
Experiments limited to e-commerce domain (Amazon, eBay); other domains might have different leakage patterns
Live website testing is subject to A/B testing and interface changes that complicate reproducibility
Focuses on passive observers; does not cover active adversarial extraction by websites

Reproducibility

Code: https://github.com/jrohsc/SPILLage

Datasets and code are available at https://github.com/jrohsc/SPILLage. Uses commercial LLMs (OpenAI) and live websites (Amazon, eBay), so exact replication depends on API availability and website stability.

📊 Experiments & Results

Evaluation Setup

180 shopping tasks on live Amazon and eBay sites, executed by agents with personas containing 10 attributes (mixed relevant/irrelevant).

Benchmarks:

SPILLage Benchmark (Web Shopping Task) [New]

Metrics:

Oversharing Count (by category: Content/Behavior, Explicit/Implicit)
Task Success Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Oversharing frequency across different LLM backbones on Amazon, showing the dominance of behavioral leakage.
SPILLage (Amazon)	Explicit Behavioral Oversharing Count	0	1151	+1151
SPILLage (Amazon)	Explicit Content Oversharing Count	0	246	+246
Impact of removing task-irrelevant information on task success, demonstrating privacy-utility alignment.
SPILLage (Amazon)	Task Success Rate	73.3	91.2	+17.9
SPILLage (Amazon)	Task Success Rate	86.7	95.6	+8.9

Experiment Figures

Visual examples of the 4 oversharing types in the taxonomy using a shopping scenario

Main Takeaways

Oversharing is pervasive across all tested models (o3, o4-mini, gpt-4o) and frameworks (Browser-Use, AutoGen), not limited to weaker models
Behavioral oversharing (clicks, scrolls) dominates Content oversharing by approximately 5x, meaning text-only filters miss the majority of privacy leaks
Privacy and utility are aligned: filtering out task-irrelevant information before the agent acts improves task success rates by up to 17.9%, likely by reducing distraction
Prompt-level mitigations (telling the agent not to share) are often ineffective or can even worsen oversharing behavior

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and Agentic frameworks
Contextual Integrity (privacy framework)
Web automation (DOM, actions like click/scroll)

Key Terms

Contextual Integrity: A privacy framework evaluating if information flows respect context-specific norms (e.g., a doctor needs health info, a grocer does not)

Oversharing: The unintentional disclosure of task-irrelevant user information to external parties

Behavioral Oversharing: Leaking information through navigation actions like clicking specific filters or scrolling specific sections, rather than typing text

Implicit Oversharing: Disclosing information not verbatim, but in a way that allows a passive observer to infer sensitive attributes (e.g., browsing 'Single Mom' supplies implies 'Divorced')

Passive Observer: A third party (like a website operator) that monitors agent actions without interfering or injecting malicious prompts

LLM-Judge: Using a separate LLM to evaluate the outputs or actions of the main agent

S_irrelevant: The set of user attributes available in the context that are NOT necessary for the specific task at hand