Language Models May Verbatim Complete TextThey Were Not Explicitly Trained On

📝 Paper Summary

Training Data Membership Memorization in LLMs Adversarial Data Construction

LLMs can verbatim complete text sequences even when those sequences are removed from training data based on n-gram overlap, revealing that current membership definitions are easily gamed.

Core Problem

Standard definitions of training data membership rely on n-gram overlap, assuming that if a model completes a text verbatim, that text must be in the training set.

Why it matters:

Privacy and copyright auditing rely on checking if a model can reproduce text to determine if it was trained on it
Unlearning methods often assume that removing specific n-grams is sufficient to prevent output generation, which this paper disproves
Malicious actors could poison models or evade contamination detection by transforming text so it has no n-gram overlap but is still reconstructible

Concrete Example: A model retrained from scratch after removing all 50-grams of a specific text (e.g., a famous quote) can still complete it verbatim because it learns from shorter overlaps or near-duplicates (e.g., '1477 by topic' matches '1477 by Topic').

Key Novelty

Gaming n-gram Membership via Auxiliary Information

Demonstrates that removing exact n-gram matches from training data does not stop models from generating the removed text (lingering sequences)
Shows that adversarial datasets (using token dropouts or casing flips) can force a model to learn a target text without containing any valid n-grams of that text

Architecture

Overview of the two main experimental pipelines: (Left) Retraining with filters to find lingering sequences, and (Right) Adversarial fine-tuning to force completion without membership.

Evaluation Highlights

Retrained 1.6B model verbatim completes ~40% of sequences that were explicitly removed using exact n-gram filtering
Even with aggressive filtering (removing any text sharing a 5-gram), ~1% of removed sequences remain verbatim completable
Adversarial token dropout (50% drop rate) allows a 0.5B model to learn and verbatim complete a target text despite zero n-gram overlap in training

Breakthrough Assessment

8/10

Fundamentally challenges the standard operational definition of 'membership' used in privacy and copyright, showing it is insufficient and easily circumvented.

⚙️ Technical Details

Problem Definition

Setting: Determine if a target sequence x is a member of dataset D and if model M can complete x given a prefix p

Inputs: Training dataset D, target sequence x = [p || s]

Outputs: Boolean membership status (based on n-gram overlap) and completion status (M(p) == s)

Pipeline Flow

Phase 1: Pre-train Base Model (M_base) on D_base
Phase 2: Identify Memorized Sequences (D_mem) from M_base
Phase 3: Filter D_base to create D_filter by removing n-grams of D_mem
Phase 4: Retrain Model (M_filter) from scratch on D_filter
Phase 5: Adversarial Fine-tuning (create D_ft from target x using transformations f(x) with no n-gram overlap)

System Modules

Base Pre-trainer

Train initial model to identify naturally memorized sequences

Model or implementation: GPT-2 architecture (350M to 2.8B parameters)

Sequence Filter

Remove sequences from training data based on n-gram overlap

Model or implementation: Sliding window n-gram matcher

Adversarial Transformation

Transform target text x into training samples D_ft that have zero n-gram overlap with x

Model or implementation: Heuristic algorithms (Chunking, Token Dropout, Casing Flips)

Novel Architectural Elements

Counterfactual retraining pipeline: explicitly retraining models from scratch on datasets with guaranteed n-gram removal to test definition boundaries
Adversarial data construction pipeline: systematic generation of training data (using token dropouts/casing flips) that forces verbatim completion without valid n-gram membership

Modeling

Base Model: GPT-2 (350M, 774M, 1.6B, 2.8B variants)

Training Method: Standard Next-Token Prediction (Pre-training and Fine-tuning)

Objective Functions:

Purpose: Maximize probability of the next token in the sequence.

Formally: Minimize Cross-Entropy Loss

Training Data:

Pre-training: FineWeb-Edu (33.6B tokens)
Filtering: Removed extracted sequences (length k=50 or 100)
Adversarial Fine-tuning: 2,000 examples constructed from single target text (approx 1000 chars)

Key Hyperparameters:

fine_tuning_batch_size: 32
fine_tuning_learning_rate: 1e-5

Compute: Not reported in the paper

Comparison to Prior Work

vs. MinHash/Suffix Array: Proves these methods fail to prevent completion if 'auxiliary information' (like shorter overlaps or patterns) remains
vs. Goldfish Loss: Goldfish loss masks tokens in loss during training; this paper's token dropout modifies the input data itself to evade membership detection while still enabling learning
vs. Dataset Inference (Maini et al.) [not cited in paper]: Focuses on generative completion capability rather than statistical membership scores

Limitations

Evaluation limited to exact verbatim completion; does not fully explore semantic understanding
Adversarial experiments focused on specific text types (news, code, blogs); generalization to all text types not guaranteed
Computational cost of retraining from scratch limits the scale of experiments (max 2.8B model)
Proxy metrics for 'generalization' vs 'memorization' (e.g., GPT-2-XL agreement) are imperfect heuristics

Reproducibility

Available: Uses public FineWeb-Edu dataset and standard GPT-2/Gemma/Qwen architectures. Uses LLM.c for training. Missing: Specific seed values for random sampling of datasets. Code URL not provided in paper text.

📊 Experiments & Results

Evaluation Setup

Pre-train models, extract memorized sequences, filter them from data, retrain, and check if sequences are still completed. Separately, fine-tune on adversarial data.

Benchmarks:

Lingering Completion Test (Verbatim text completion) [New]
Adversarial Completion Test (Recovering target text from non-overlapping noisy examples) [New]

Metrics:

Lingering fraction (%)
Edit similarity (%)
Verbatim completion rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results showing that models can still complete sequences even after those sequences are removed from training data (Lingering Sequences).
FineWeb-Edu Retraining	Lingering fraction	0	40	+40
FineWeb-Edu Retraining	Lingering fraction	0	1	+1
Adversarial fine-tuning results demonstrating that models can learn to complete text from data with zero n-gram overlap.
Target: Willow (blog)	Edit similarity	0	90	+90
Target: Karpathy (tweet)	Completion Success	0	100	+100

Experiment Figures

Bar chart of Lingering Fraction (%) across different model sizes and filter strengths (n=5, 10, 20, 50)

Heatmaps/Curves of completion success for adversarial methods (Chunking, Dropouts, Casing Flips) across different parameters.

Main Takeaways

Lingering sequences persist because they are either 'de facto' members (via shorter m-gram overlaps not captured by n-gram filters) or low-entropy patterns (counting, templates)
Stronger filtering (n=5 vs n=50) shifts lingering sequences from verbatim memorization of content to generalizable patterns
Token dropout is a highly effective adversarial strategy: training on sequences with 50% dropped tokens allows perfect reconstruction while bypassing n-gram checks
Adversarial completion capability scales with model size; larger models are better at 'denoising' the non-overlapping training data into the target sequence

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model pre-training and fine-tuning
Familiarity with n-gram tokenization and matching
Basic concepts of membership inference and data extraction

Key Terms

n-gram: A contiguous sequence of n items (tokens) from a given sample of text or speech

verbatim completion: When a language model generates the exact suffix of a text sequence when prompted with its prefix

lingering sequences: Text sequences that a model can still complete verbatim even after they have been explicitly filtered out of the training dataset

membership inference: The task of determining whether a specific data point was used to train a machine learning model

BPE: Byte-Pair Encoding—a tokenization method that iteratively merges the most frequent pair of bytes (or characters) into a single new token

token dropout: An adversarial technique where random tokens in a sequence are masked/dropped to prevent n-gram overlap while retaining semantic information

MinHash: An algorithm used to estimate the similarity of two sets (like documents) quickly, often used for approximate deduplication

suffix array: A data structure that enables efficient lookup of all substrings in a text corpus, used for exact deduplication

output suppression: The goal of machine unlearning where the model is prevented from generating specific sequences (e.g., harmful content)