Extracting books from production language models

📝 Paper Summary

Memorization and extraction Copyright infringement in Generative AI Adversarial attacks / Jailbreaking

A two-phase attack procedure combining jailbreaking and iterative continuation can extract large portions of in-copyright books from production LLMs like Claude 3.7 Sonnet, despite safety guardrails.

Core Problem

It is unclear if long-form extraction of copyrighted training data (demonstrated on open-weight models) is feasible on production LLMs, which employ extensive model- and system-level safeguards to prevent data leakage.

Why it matters:

Legal debates on fair use hinge on whether models merely learn abstract patterns or memorize and reproduce creative expression verbatim
Courts in the U.S. and Germany have reached different preliminary conclusions about whether model outputs constitute infringing copies
Production models (e.g., GPT-4, Claude) have black-box APIs and safety filters that theoretically prevent the specific behavior (verbatim completion) used in prior extraction research

Concrete Example: When prompted to 'Continue the following text exactly...' with the first sentence of 'Harry Potter', a production LLM usually refuses. Using the proposed method, Claude 3.7 Sonnet outputs nearly the entire book (95.8%) verbatim.

Key Novelty

Two-Phase Extraction Attack for Production LLMs

Phase 1: Bypass initial refusal using Best-of-NN (BoN) jailbreaking to force the model to complete a short seed prefix of the target book
Phase 2: Use iterative continuation prompts to generate long sequences of text, circumventing length limits and filters by feeding back the model's own output

Architecture

The two-phase extraction pipeline: (1) Initial probe with optional jailbreak, (2) Iterative continuation loop

Evaluation Highlights

Extracted 95.8% of 'Harry Potter and the Sorcerer's Stone' near-verbatim from Claude 3.7 Sonnet
Extracted 76.8% of 'Harry Potter' from Gemini 2.5 Pro and 70.3% from Grok 3 without requiring any jailbreak (direct compliance)
GPT-4.1 proved most resistant, yielding only 4.0% of 'Harry Potter' even after extensive jailbreaking attempts
Recovered >94% of the text for two in-copyright books and two public domain books from Claude 3.7 Sonnet

Breakthrough Assessment

9/10

Provides the first definitive proof that production LLMs with safety guardrails can still leak entire copyrighted books, directly impacting active legal litigation regarding AI copyright.

⚙️ Technical Details

Problem Definition

Setting: Black-box extraction of memorized training data from aligned production LLMs via API access

Inputs: A short ground-truth seed string s (prefix p + target suffix t) from a book

Outputs: Long-form generated text G attempting to reproduce the book B

Pipeline Flow

Phase 1: Seed Probe (Check feasibility / Jailbreak)
Phase 2: Iterative Continuation (Long-form generation)
Evaluation: Sequence Matching (Compute nv-recall)

System Modules

Phase 1 Probe

Determine if the model can complete a short prefix verbatim; uses BoN jailbreak if refused

Model or implementation: Target Production LLM (Claude 3.7, GPT-4.1, Gemini 2.5, Grok 3)

Phase 2 Loop

Iteratively prompt the model to continue generating text from the last output

Model or implementation: Target Production LLM

Sequence Matcher

Identify near-verbatim blocks between generated text and ground truth book

Model or implementation: Python difflib (modified)

Novel Architectural Elements

Two-phase extraction protocol specifically designed for black-box production models
nv-recall metric pipeline: greedy block matching with recursive merge-and-filter steps to quantify long-form memorization robust to minor corruptions

Modeling

Base Model: Evaluated on: Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, Grok 3

Training Method: Not applicable (Evaluation of existing closed-source models)

Adaptation: None (Inference-only attacks)

Trainable Parameters: None (Black-box API access only)

Compute: Not reported in the paper

Reproducibility

Code: https://github.com/a-cooper/production-extraction

📊 Experiments & Results

Evaluation Setup

Targeted extraction of 11 in-copyright books (published pre-2020) and public domain works

Benchmarks:

Harry Potter and the Sorcerer's Stone (Copyrighted Book Extraction)
Frankenstein (Public Domain Book Extraction)
The Great Gatsby (Public Domain Book Extraction)

Metrics:

nv-recall (near-verbatim recall percentage)
Phase 1 Success Rate (compliance with initial prompt)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Harry Potter and the Sorcerer's Stone	nv-recall	4.0	95.8	+91.8
Harry Potter and the Sorcerer's Stone	nv-recall	4.0	76.8	+72.8
Harry Potter and the Sorcerer's Stone	nv-recall	4.0	70.3	+66.3
Harry Potter and the Sorcerer's Stone	BoN Attempts	1	20	+19

Experiment Figures

Bar chart comparing nv-recall (percentage of Harry Potter extracted) across the four tested models

Main Takeaways

Gemini 2.5 Pro and Grok 3 frequently comply with direct requests to output copyrighted text, requiring no jailbreak
Claude 3.7 Sonnet has strong initial refusals but is catastrophically vulnerable to Best-of-NN jailbreaking, leading to >95% book extraction
GPT-4.1 is significantly more robust, requiring high jailbreak budgets and often refusing during the continuation phase (Phase 2)
System-level safeguards are insufficient to prevent large-scale copyright leakage when simple adversarial techniques are applied

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM memorization vs. generalization
Familiarity with jailbreaking techniques (adversarial prompts)
Basic knowledge of copyright law concepts (fair use, transformative use, reproduction)

Key Terms

Best-of-NN (BoN): A jailbreak technique where N variations of a prompt (perturbed by noise, case changes, etc.) are submitted, and the response most likely to violate safety policies is selected

nv-recall (near-verbatim recall): A metric measuring the proportion of a book's text that is recovered in the model's output as long, contiguous, near-exact matches

discoverable extraction: The standard research method for measuring memorization: prompting with a prefix and checking if the model autocompletes the exact suffix

jailbreak: Adversarial prompting designed to bypass an LLM's safety filters or refusal policies

greedy approximation of longest common substring: An algorithm used here to identify matching text blocks between generated output and the original book, allowing for minor gaps or formatting differences