ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data

📝 Paper Summary

Language Model Post-Training Privacy and Copyright in LLMs

ParaPO reduces unintentional verbatim memorization in language models by training them to prefer paraphrased versions of memorized content over the original text, while retaining the ability to quote when explicitly instructed.

Core Problem

Language models unintentionally regurgitate pre-training data verbatim, causing copyright and privacy issues, while existing unlearning methods fail to generalize beyond specific target domains.

Why it matters:

Unintentional reproduction diminishes creative capacity and introduces legal risks like copyright violation and plagiarism
Existing unlearning methods are effective only on the specific data they are trained to forget, failing to reduce regurgitation in general domains (e.g., creative writing)
Simply filtering pre-training data is insufficient because predicting exactly what will be regurgitated is difficult

Concrete Example: A model might output the exact opening of 'A Tale of Two Cities' when asked to write a story, violating copyright. While unlearning might fix this specific book, it won't stop the model from regurgitating a different web article or news piece.

Key Novelty

Paraphrase Preference Optimization (ParaPO)

Identifies verbatim memorized segments in the model and generates paraphrases of them using a stronger teacher model
Uses Direct Preference Optimization (DPO) to train the model to prefer the paraphrase (chosen) over the original memorized segment (rejected)
Introduces conditional system prompts (Copy-Yes/Copy-No) to allow the model to distinguish between 'unintentional regurgitation' (bad) and 'intentional quotation' (good)

Architecture

The ParaPO pipeline: identifying memorized segments, generating paraphrases, and applying DPO.

Evaluation Highlights

Reduces unintentional regurgitation of book snippets from 15.6% to 1.6% on Llama3.1-8B, far outperforming unlearning baselines
Achieves a 25.4% reduction in unintentional regurgitation during creative writing tasks compared to the base model
Preserves quotation utility: when instructed to allow copying, the model maintains a quotation recall of 27.5 (vs 28.0 baseline)

Breakthrough Assessment

8/10

Significant advance in generalizable regurgitation mitigation. Unlike unlearning (which is narrow), ParaPO teaches a general behavior of 'paraphrasing' that works across domains while maintaining utility.

⚙️ Technical Details

Problem Definition

Setting: Post-training alignment to minimize the probability of generating verbatim sequences from the pre-training corpus unless explicitly prompted

Inputs: Prompt requiring generation (e.g., partial book text or creative writing instruction)

Outputs: Generated text that conveys the meaning without verbatim copying (paraphrase)

Pipeline Flow

Memorization Detection: Sample segments -> Prompt Model -> Check for exact continuation
Paraphrase Generation: Use strong LLM (Llama-3.1-70B) to paraphrase detected memorized segments
Dataset Construction: Create (original, paraphrase) pairs
Preference Optimization: Fine-tune model using DPO to prefer paraphrases

System Modules

Memorization Detector (Data Construction)

Identify segments the model has memorized verbatim

Model or implementation: Target Model (Llama-3.1-8B or Qwen2.5-7B)

Paraphraser (Data Construction)

Generate semantically equivalent but distinct text

Model or implementation: Llama3.1-70B-Instruct

Policy Optimizer

Update model weights to prefer paraphrases

Model or implementation: Target Model (Llama-3.1-8B or Qwen2.5-7B)

Novel Architectural Elements

Conditional Preference Optimization: Reversing 'chosen' and 'rejected' labels based on system prompt (Copy-Yes vs Copy-No) to enable controllable regurgitation

Modeling

Base Model: Llama3.1-8B and Qwen2.5-7B (Base and Instruction-Tuned variants)

Training Method: Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Train model to prefer paraphrase over verbatim text.

Formally: L_DPO(x,y_w,y_l) = -log sigmoid( beta * (log(pi_theta(y_w|x)/pi_ref(y_w|x)) - log(pi_theta(y_l|x)/pi_ref(y_l|x))) )

Training Data:

16,000 memorized segments selected from 1 million random documents from Pile-CC
Filtered by prompting target model with 64 tokens and checking next 32 tokens for overlap
Paraphrases generated by Llama3.1-70B-Instruct

Key Hyperparameters:

beta: Scaling factor (standard DPO parameter, value not explicitly detailed in main text but implied standard)
segment_length: 96 tokens
prompt_prefix_length: 64 tokens

Compute: Not reported in the paper

Comparison to Prior Work

vs. Unlearning (GA/NPO): ParaPO generalizes to reduce regurgitation in unseen domains (e.g., creative writing), whereas unlearning only affects the specific targeted dataset
vs. Data Filtering: ParaPO is a post-training intervention, applicable when retraining is too expensive or data is already seen
vs. Supervised Fine-tuning on Paraphrases: ParaPO uses preference learning (DPO), which is shown to be more effective than simple SFT on paraphrases

Limitations

Slight degradation in general capabilities (math, reasoning) observed in base models
Requires identifying memorized segments first to construct training data
Evaluation relies on n-gram/ROUGE overlap, which may not capture all forms of semantic plagiarism

Reproducibility

Code: https://github.com/chentong0/ParaPO

Code available at GitHub. Data construction process (Pile-CC sampling, paraphrase generation) is fully described. Memorization detection method is specified (ROUGE-L > 0.5 on continuations).

📊 Experiments & Results

Evaluation Setup

Targeted evaluation (can model recite specific books/web text?) and Untargeted evaluation (does model plagiarize in creative writing?)

Benchmarks:

Training Data Extraction Challenge (Targeted Regurgitation (Web))
BookSum (Targeted Regurgitation (Books))
Creative Writing (Story, Poem, Speech) (Untargeted Regurgitation)
MMLU, GSM8K, BBH, IFEval, AlpacaEval2 (General Utility / Instruction Following)

Metrics:

Extraction Ratio (Targeted): % cases where ROUGE-L > 0.5 with ground truth
11-gram Overlap Ratio (Untargeted/Creative): Fraction of 11-grams appearing in The Pile
Quotation Recall: Ability to quote famous text when prompted
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of ParaPO against unlearning baselines on Llama3.1-8B Base Model showing ParaPO's superior generalization.
BookSum (Targeted)	Extraction Ratio	15.6	1.6	-14.0
Web Snippets (Targeted)	Extraction Ratio	28.2	21.6	-6.6
Creative Writing	11-gram Overlap	17.3	12.9	-4.4
Results on Instruction-Tuned Models (Tulu3-8B) combining ParaPO with System Prompts.
Web Snippets	Extraction Ratio	19.9	7.6	-12.3
Quotation Recall (Book/Poem)	Recall	0.4	27.5	+27.1

Experiment Figures

Radar charts comparing Regurgitation vs. Utility for Base, Unlearning, and ParaPO models.

Main Takeaways

Unlearning methods (GA, NPO) work well on the specific domain they are trained on (e.g., books) but fail to generalize to other domains (e.g., web text).
ParaPO effectively generalizes: training on a small set of paraphrases reduces regurgitation across diverse datasets (books, web, creative writing).
Training on actual memorized segments is crucial; applying ParaPO to random segments is significantly less effective.
Combining ParaPO with system prompts (Sys Mix) allows for a 'best of both worlds' scenario: low unintentional regurgitation but high recall when quoting is desired.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and next-token prediction
Familiarity with Reinforcement Learning from Human Feedback (RLHF) concepts
Knowledge of Direct Preference Optimization (DPO)

Key Terms

regurgitation: The phenomenon where a language model generates training data verbatim

DPO: Direct Preference Optimization—an algorithm that optimizes a language model to align with preferences without an explicit reward model

ROUGE-L: A metric measuring the longest common subsequence between two texts, used here to detect verbatim overlap

unlearning: Techniques designed to make a model 'forget' specific subsets of training data

Tulu: An instruction-tuned model family based on Llama, used here as a base for experiments

The Pile: A large-scale, diverse dataset often used for training LLMs

infinite-gram: A method/tool to compute n-gram overlap with extremely large corpora (like The Pile) to measure memorization

system prompt: A high-level instruction given to the model (e.g., 'You are a helpful assistant') that governs its behavior for the interaction