Ward: Provableragdataset inference via llm watermarks

📝 Paper Summary

RAG Privacy & Security Copyright Protection Membership Inference

WARD enables data owners to provably detect if their documents were used in a RAG system by embedding LLM watermarks that persist through the retrieval and generation pipeline.

Core Problem

Data owners currently have no way to prove unauthorized usage of their content in RAG systems, and existing methods (Membership Inference Attacks) are unreliable in realistic settings.

Why it matters:

Copyright holders need technical tools to audit RAG providers and enforce opt-out requests
Current baselines fail when RAG corpora contain redundant facts (multiple documents sharing information), a common real-world scenario
False accusations against model providers must be statistically controlled to ensure trust in auditing tools

Concrete Example: A RAG provider scrapes a news site. The owner wants to check if their articles are in the corpus. Existing methods get confused if the same news facts appear in other authorized documents. WARD succeeds by checking for a specific watermark pattern that only the owner's documents possess.

Key Novelty

Proactive RAG Dataset Inference via Watermarking

Instead of relying on post-hoc analysis of model outputs (like perplexity), the data owner proactively watermarks their documents before publication
The method (WARD) leverages the property that red-green watermarks propagate through RAG: if the LLM retrieves a watermarked document, the generated answer retains statistical traces of the watermark
Aggregates weak signals across multiple queries into a single rigorous statistical test (p-value) for the entire dataset

Architecture

Workflow of WARD: Data Owner watermarks data -> RAG Provider scrapes it -> Owner queries RAG -> Detects watermark in aggregated responses

Evaluation Highlights

Achieves 100% accuracy in detecting dataset usage across all tested models (Llama-3, Claude-3, GPT-3.5) and settings
Maintains 0 false positives even when RAG providers use defensive prompts designed to prevent data leakage
Outperforms state-of-the-art baselines (SIB, AAG) which fail near-completely in realistic settings with fact redundancy

Breakthrough Assessment

9/10

Establishes a new problem setting (RAG-DI), provides the first rigorous benchmark (FARAD), and proposes a solution that essentially solves the problem (100% accuracy, provable guarantees) where baselines fail.

⚙️ Technical Details

Problem Definition

Setting: Black-box RAG Dataset Inference (RAG-DI): A data owner with dataset D_do queries a RAG system M* with corpus D to determine if D_do is a subset of D.

Inputs: Query access to RAG system M*, Data owner's dataset D_do

Outputs: Binary decision: 1 (IN) if dataset is used, 0 (OUT) otherwise, with a statistical p-value

Pipeline Flow

Data Watermarking (Owner paraphrases documents with watermark)
Query Generation (Owner generates questions for each document)
Black-box Querying (Owner sends questions to RAG system)
Response Aggregation (Owner computes joint p-value across all responses)

System Modules

Watermarker

Embed watermark into the owner's documents before they are potentially scraped

Model or implementation: Llama-3-8B-Instruct (used for paraphrasing)

RAG System

Retrieves documents and generates answers (the system being audited)

Model or implementation: Various (GPT-3.5, Claude 3 Haiku, Llama-3.1-70B)

Detector

Calculates p-value for the presence of the watermark in the aggregated responses

Model or implementation: Statistical Test (Z-test)

Novel Architectural Elements

Application of LLM watermarking as a provenance tracking mechanism for RAG corpora rather than model training
Aggregated p-value testing across multiple RAG queries to detect diluted signals

Modeling

Base Model: Llama-3-8B-Instruct (for watermarking/paraphrasing)

Training Method: Inference-time watermarking (logit bias)

Trainable Parameters: None (inference-time intervention)

Key Hyperparameters:

watermark_context_width_h: 2
watermark_strength_delta: 3.5
green_token_ratio_gamma: 0.25
+ 1 more
significance_threshold_alpha: 3e-5

Compute: Not reported in the paper

Comparison to Prior Work

vs. SIB: WARD is proactive (requires watermarking) but robust to defenses and fact redundancy, whereas SIB fails in Hard settings
vs. AAG: WARD uses natural queries rather than suspicious 'system' queries, making it harder to detect/block
vs. Training Data Extraction [not cited in paper]: WARD targets retrieval corpora (inference time) rather than model weights (training time), allowing detection without model access
+ 1 more
vs. Radioactive Data [not cited in paper]: WARD uses token-level watermarks for text rather than feature-space marking for images/classifiers

Limitations

Requires the data owner to proactively watermark data before it is scraped; cannot be applied retroactively to already-published clean data
Relies on the RAG system preserving enough original text tokens (though shown to be robust to paraphrasing defenses)
Assumes the watermarked text is not heavily modified or cleansed by the RAG provider before indexing

Reproducibility

Code: https://github.com/eth-sri/ward

publicly available (https://github.com/eth-sri/ward). Code and the FARAD dataset are released. The paper includes detailed prompts in Appendix F.

📊 Experiments & Results

Evaluation Setup

Detecting if a subset of documents is present in a RAG corpus using black-box queries

Benchmarks:

FARAD-Easy (Dataset Inference without fact redundancy) [New]
FARAD-Hard (Dataset Inference with fact redundancy (multiple articles sharing same facts)) [New]

Metrics:

Accuracy (Dataset-level detection)
True Positive Rate
False Positive Rate
p-value
Statistical methodology: Z-test for watermark detection; p-values reported

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on FARAD-Hard (realistic setting with redundant facts) show WARD achieves perfect detection while baselines fail.
FARAD-Hard (Llama-3.1-70B, Naive Prompt)	Accuracy	See Figure 4 (Visual implies ~50-60%)	1.00	Large positive margin
FARAD-Hard (GPT-3.5, Naive Prompt)	Accuracy	See Figure 4 (Visual implies failure)	1.00	Large positive margin
FARAD-Hard (Claude 3 Haiku, Naive Prompt)	Accuracy	See Figure 4 (Visual implies failure)	1.00	Large positive margin
Ablation on watermark strength (delta) shows a sweet spot for detection without destroying text quality.
FARAD-Hard	Clean Grade (Text Quality)	0.903	0.898	-0.005
FARAD-Hard	Paraphrase Quality (P-SP)	1.000	0.933	-0.067

Experiment Figures

Grid of results (Heatmap-style) comparing WARD vs baselines (FACTS, AAG, SIB) across multiple models and settings

Evolution of p-values as the number of queries increases

Main Takeaways

WARD achieves 100% accuracy in detecting dataset usage across all settings (Easy/Hard, Naive/Defensive prompts), whereas baselines collapse in the realistic Hard setting.
LLM Watermarks propagate effectively through the RAG pipeline: retrieval and generation do not destroy the signal.
Existing datasets (Enron, HealthcareMagic) are insufficient for RAG-DI evaluation because they lack fact redundancy; FARAD fills this gap.
WARD provides rigorous statistical guarantees (p-values < 1e-10) with zero false positives, satisfying the requirement for legally/ethically sound auditing tools.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) architecture
LLM Watermarking (Red-Green schemes)
Hypothesis testing (p-values, z-scores)

Key Terms

RAG-DI: RAG Dataset Inference—the problem of detecting whether a specific dataset is included in a RAG system's knowledge base

LLM Watermarking: Embedding a statistical signal into text generated by an LLM (typically by biasing vocabulary choice) to allow later detection

Red-Green Watermark: A specific watermarking scheme where the vocabulary is split into 'green' (promoted) and 'red' (demoted) tokens based on the preceding context

MIA: Membership Inference Attack—determining if a specific data point was used to train a model or is present in a database

Fact Redundancy: A realistic scenario where the same factual information appears in multiple documents in a corpus, complicating attribution

FARAD: Fact-Redundant Article Dataset—a new benchmark introduced in this paper designed to evaluate RAG-DI under realistic conditions of information overlap

z-score: A statistical measurement describing a value's relationship to the mean of a group of values, used here to measure watermark strength

p-value: The probability of observing results at least as extreme as the observed results assuming the null hypothesis (no watermark) is true

system prompt defense: Instructions given to an LLM (e.g., 'do not reveal sources') to prevent it from leaking information about its context