Characterizing Memorization in Diffusion Language Models: Generalized Extraction and Sampling Effects

📝 Paper Summary

Language Model Memorization Diffusion Language Models (DLMs) Privacy and Copyright

The paper formulates a generalized probabilistic extraction framework for Diffusion Language Models, proving that increasing sampling resolution increases verbatim memorization, yet DLMs exhibit lower PII leakage than scale-matched autoregressive models.

Core Problem

Standard memorization metrics rely on prefix-based decoding suited for autoregressive models (ARMs), but Diffusion Language Models (DLMs) generate via bidirectional, non-causal denoising, making existing metrics inapplicable.

Why it matters:

Memorization of training data leads to privacy leakage (PII) and copyright infringement risks in deployed models.
DLMs are emerging as competitive alternatives to ARMs, but their unique generation dynamics make their privacy risks largely uncharacterized.
Without a formal definition for DLM memorization, it is impossible to audit these models for data leakage or compare them fairly against ARMs.

Concrete Example: In an ARM, memorization is tested by providing a prefix 'My email is...' and checking if the suffix matches training data. In a DLM, generation happens by gradually denoising a fully masked sequence or random subsets, meaning there is no fixed 'prefix' order, so the standard test fails to capture how the model exposes data.

Key Novelty

Generalized Probabilistic Extraction Framework for DLMs

Generalizes the definition of 'discoverable extraction' to handle arbitrary masking patterns and stochastic sampling trajectories, rather than just left-to-right generation.
Establishes a theoretical link between 'sampling resolution' (number of denoising steps) and memorization: recovering tokens in finer steps increases the chance of exactly reproducing training data.
Proposes that Autoregressive decoding is mathematically a special limiting case of diffusion generation where the sampling resolution is maximal (one token per step).

Architecture

A visual example of PII (email header) memorization in a diffusion model (LLaDA-8B), contrasting masked inputs with the generated output.

Evaluation Highlights

Diffusion models (DLMs) show substantially lower PII leakage than ARMs: 0 exact email extractions for DLM-1.1B vs 213 for ARM-1.1B under aligned settings (p=0.99).
Increasing sampling resolution strictly increases memorization: LLaDA-8B extraction counts rose from 9 (one-step) to 179 (max-resolution) for emails (p=0.50).
Metric validation confirms training data (Enron) has consistently higher reconstruction likelihood than disjoint test data (TREC Spam), verifying the metric measures memorization, not just generalization.

Breakthrough Assessment

7/10

Establishes the first formal framework for measuring memorization in DLMs and provides a strong theoretical link between sampling steps and privacy risk. Empirical results are solid, though limited to smaller scales (1.1B/8B) compared to SOTA ARMs.

⚙️ Technical Details

Problem Definition

Setting: Quantifying the probability that a generative model produces a specific training sequence exactly (verbatim) or approximately given a set of observed tokens.

Inputs: A set of observed tokens z_M_bar (context) and a set of masked positions M.

Outputs: A reconstructed sequence z_hat_M over n independent queries.

Pipeline Flow

Masked Input Construction (apply arbitrary mask M to sequence)
Iterative Denoising (perform N steps of reverse diffusion)
Extraction Verification (check if generated tokens match ground truth)

System Modules

Generalized Extraction Evaluator

Calculates the probability of recovering masked tokens z_M given context z_M_bar

Model or implementation: Evaluated on LLaDA-8B and custom trained DLMs

Novel Architectural Elements

Integration of sampling resolution (N) as a variable in the memorization definition, treating ARMs as the N=|M| limit case of DLMs.

Modeling

Base Model: LLaDA-8B (scaled DLM) and custom trained DLMs (170M, 690M, 1.1B)

Training Method: Pretraining followed by SFT (Supervised Fine-Tuning)

Objective Functions:

Purpose: Approximate the true data distribution by minimizing the negative log-likelihood upper bound.

Formally: L = Integral of E[sum(-log p(z_0 | z_t))] dt

Adaptation: Fine-tuning on Enron email dataset for 1 epoch

Training Data:

Pretraining: SlimPajama
Fine-tuning/Evaluation: Enron email dataset (PII source)
Control/Test: TREC 2007 Spam corpus

Key Hyperparameters:

compute_budget: 10^21 FLOPs (for custom 1.1B models)
sampling_resolution: Varied (1, 2, 5, 10, |M|)
mask_ratios: 0.20, 0.25, 0.30

Compute: Custom models trained on LUMI cluster; specific GPU hours not reported in the paper

Comparison to Prior Work

vs. Hayes et al.: Generalizes extraction to arbitrary masking/steps; Hayes is specific to prefix-based ARMs.
vs. Carlini et al.: Applies to non-causal diffusion generation; Carlini focuses on causal ARMs.

Limitations

Validation limited to text-based diffusion; does not cover continuous diffusion.
Empirical verification done primarily on 1.1B and 8B scales; behavior at 100B+ scale not tested.
Requires approximation of single-trial recovery probability due to stochastic trajectories.

Reproducibility

Code availability is not explicitly provided in the paper text. The paper uses public datasets (SlimPajama, Enron, TREC 2007) and standard architectures (LLaDA).

📊 Experiments & Results

Evaluation Setup

Exact and approximate reconstruction of training data (SlimPajama/Enron) given masked inputs.

Benchmarks:

Enron Email Dataset (PII Memorization (Emails/Phone numbers))
TREC 2007 Spam (Control dataset (unseen data))

Metrics:

(n, p)-discoverable extraction count
Exact-recovery success rate
Hamming distance (for relaxed memorization)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis of PII memorization between Diffusion (DLM) and Autoregressive (ARM) models under aligned settings.
Enron (Emails)	Extraction Count (p=0.99)	213	0	-213
Enron (Phone Numbers)	Extraction Count (p=0.50)	5	0	-5
Impact of sampling resolution (number of steps) on memorization rates.
Enron (Emails)	Extraction Count (p=0.50)	9	179	+170
Enron (Phone Numbers)	Extraction Count (p=0.50)	7	23	+16

Experiment Figures

The empirical relationship between sampling steps and exact-recovery probability.

Comparison of reconstruction likelihood distributions for Training Data (Enron) vs. Test Data (TREC Spam).

Main Takeaways

Increasing sampling resolution (more denoising steps) monotonically increases the probability of exact training data extraction.
Under aligned conditions (same pre-training data and compute), Diffusion Language Models exhibit significantly lower PII leakage than Autoregressive Models.
Autoregressive decoding can be theoretically viewed as a 'worst-case' limit of diffusion generation (maximum resolution) regarding privacy leakage.
The proposed metric distinguishes true memorization from generalization, as evidenced by the gap between reconstruction likelihoods on training data (Enron) vs. unseen data (TREC).

📚 Prerequisite Knowledge

Prerequisites

Autoregressive Language Models (ARMs) vs. Masked Diffusion Models (DLMs)
Denoising diffusion probabilistic models (forward/reverse processes)
Memorization definitions (k-eidetic, discoverable extraction)

Key Terms

DLM: Diffusion Language Model—a generative model that creates text by iteratively denoising a sequence of random masks rather than predicting the next token sequentially.

ARM: Autoregressive Model—standard language models (like GPT) that generate text one token at a time from left to right.

(n, p)-discoverable extraction: A metric defining a sequence as memorized if it can be generated exactly within 'n' attempts with probability at least 'p'.

sampling resolution: The number of steps used in the diffusion reverse process to convert noise into text; fewer steps are faster but coarser, more steps are finer-grained.

PII: Personally Identifiable Information—sensitive data like emails or phone numbers.

mask token: A special token (e.g., [MASK]) used to replace original tokens during the forward diffusion process, which the model learns to predict.

Hamming distance: A metric measuring the number of positions at which the corresponding symbols in two sequences are different.