Private Fine-tuning of Large Language Models with Zeroth-order Optimization

📝 Paper Summary

Differentially Private Machine Learning Parameter-Efficient Fine-tuning (PEFT) Large Language Model Optimization

DP-ZO enables private fine-tuning of large language models by privatizing the scalar loss difference from zeroth-order optimization steps, avoiding the heavy memory cost of per-sample gradient clipping.

Core Problem

Standard Differentially Private Stochastic Gradient Descent (DP-SGD) requires per-example gradient clipping, which incurs massive memory overheads and engineering complexity when scaling to large foundation models.

Why it matters:

Scaling DP training to models like OPT-66B is computationally prohibitive with DP-SGD due to memory constraints.
Existing private training methods often struggle to balance privacy guarantees with utility, particularly under strict privacy budgets (pure epsilon-DP).
Pretrained checkpoints are a valuable resource, but fine-tuning them on private data remains a bottleneck due to the hardware requirements of backpropagation-based DP methods.

Concrete Example: When fine-tuning an OPT-66B model, DP-SGD would require storing and clipping gradients for each sample in a batch, likely causing Out-Of-Memory errors on standard GPUs. DP-ZO avoids this by only needing forward passes and privatizing a single scalar value per step.

Key Novelty

Differentially Private Zeroth-Order Optimization (DP-ZO)

Instead of calculating gradients via backpropagation, DP-ZO estimates the update direction using random perturbations and the difference in loss values (a scalar).
Privacy is achieved by adding noise to this scalar loss difference, rather than to a high-dimensional gradient vector, circumventing the curse of dimensionality.
Because the sensitivity is defined on a scalar, it enables the use of the Laplace mechanism for pure epsilon-DP, which is typically infeasible for high-dimensional DP-SGD.

Architecture

Comparison of DP-SGD and DP-ZO update mechanisms.

Evaluation Highlights

Achieves 93.5% of non-private performance on RoBERTa-large for SST-2 with (epsilon=8, delta=1e-5)-DP.
Outperforms DP-SGD on OPT-13B (SQuAD task) with a score of 81.3 vs 80.8, while using significantly less memory.
First method to provide non-trivial utility (73.52 on SQuAD) under pure epsilon-DP (epsilon=4) for large models using the Laplace mechanism.

Breakthrough Assessment

8/10

Significant for enabling private training on very large models where DP-SGD fails due to memory. The ability to use pure epsilon-DP effectively is a strong theoretical and practical contribution.

⚙️ Technical Details

Problem Definition

Setting: Differentially private fine-tuning of pre-trained Large Language Models (LLMs) on downstream tasks.

Inputs: Private dataset D, Pretrained Model Parameters theta, Loss function L.

Outputs: Privately fine-tuned Model Parameters theta*.

Pipeline Flow

Sample perturbation z
Forward pass (Theta + z)
Forward pass (Theta - z)
Compute loss difference scalar
Clip and Privatize scalar
Update Model

System Modules

Perturbation Sampler (Optimization Step)

Generate random perturbation vector z from standard Gaussian distribution

Model or implementation: Standard Gaussian N(0, I)

Loss Evaluator (Optimization Step)

Compute loss at perturbed parameters

Model or implementation: RoBERTa-large / OPT (1.3B - 66B)

Private Estimator

Compute clipped loss difference and add noise

Model or implementation: Laplace or Gaussian Mechanism

Novel Architectural Elements

Application of DP noise to the scalar loss difference in SPSA (Simultaneous Perturbation Stochastic Approximation) rather than the high-dimensional gradient.
Integration of Laplace mechanism for pure DP in large-scale model training.

Modeling

Base Model: RoBERTa-large (355M), OPT-1.3B, OPT-2.7B, OPT-6.7B, OPT-13B, OPT-30B, OPT-66B

Training Method: DP-ZO (Differentially Private Zeroth-Order Optimization) with LoRA (Low-Rank Adaptation)

Objective Functions:

Purpose: Estimate gradient direction.

Formally: g_est = (L(theta + phi*z) - L(theta - phi*z)) / (2*phi) * z
Purpose: Privatize update.

Formally: Add noise to the scalar term (L_diff) proportional to sensitivity C.

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: Varies (LoRA parameters only)

Training Data:

SST-2 (GLUE)
MNLI (GLUE)
QNLI (GLUE)
QQP (GLUE)
SQuAD (Question Answering)

Key Hyperparameters:

perturbation_scale_phi: Calculated as epsilon_var / ||m|| (varies)
learning_rate: Not explicitly listed as a single fixed value (tuned per task)
batch_size: Typically 2048 or 4096 (effective)
+ 2 more
clipping_threshold_C: Varies by task (tuned)
privacy_epsilon: Standard settings include 0.5, 1, 2, 4, 8

Compute: Single A100 GPU (80GB) used for experiments. DP-ZO reported to use 2x-4x less memory than DP-SGD.

Comparison to Prior Work

vs. DP-SGD: DP-ZO avoids backpropagation and per-example gradient clipping, reducing memory usage significantly.
vs. MeZO: DP-ZO adds noise clipping and aggregation to provide formal DP guarantees.
vs. DPZero: DP-ZO supports pure epsilon-DP via Laplace mechanism; DPZero uses unit sphere perturbation while DP-ZO uses Gaussian perturbation.

Limitations

Zeroth-order optimization generally has slower convergence rates dependent on problem dimensionality (though effective rank may mitigate this).
Inference-only API privacy relies on the provider not logging queries, which this method does not enforce technically.
Performance gap compared to non-private full fine-tuning still exists, particularly for smaller models or tighter privacy budgets.

Reproducibility

Code: https://github.com/princeton-nlp/DP-ZO

Code is publicly available at https://github.com/princeton-nlp/DP-ZO. The paper uses standard datasets (GLUE, SQuAD) and open-source models (RoBERTa, OPT). Hyperparameters for specific results are detailed in the appendix.

📊 Experiments & Results

Evaluation Setup

Fine-tuning pretrained models on classification and QA tasks under Differential Privacy constraints.

Benchmarks:

SST-2 (Sentiment Analysis)
MNLI (Natural Language Inference)
SQuAD (Question Answering)

Metrics:

Accuracy
F1 score (for SQuAD)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DP-ZO matches or exceeds DP-SGD performance on larger models while maintaining efficiency.
SST-2	Accuracy	93.3	93.5	+0.2
SQuAD	F1	80.8	81.3	+0.5
SQuAD	F1	89.0	85.8	-3.2
SQuAD	F1	10.0	73.5	+63.5

Experiment Figures

Performance of DP-ZO vs DP-SGD across different model sizes (1.3B to 66B) on SQuAD.

Main Takeaways

DP-ZO scales effectively to very large models (up to OPT-66B), where DP-SGD is often memory-constrained.
The method provides a strong privacy-utility trade-off, often matching or exceeding DP-SGD for large models.
DP-ZO is uniquely capable of effective training under pure epsilon-DP using the Laplace mechanism, unlike gradient-based methods.
The utility gap between private and non-private training decreases as model size increases.

📚 Prerequisite Knowledge

Prerequisites

Differential Privacy (DP) definitions (epsilon, delta)
Stochastic Gradient Descent (SGD)
Zeroth-Order Optimization (ZO) / SPSA
Parameter-Efficient Fine-Tuning (e.g., LoRA)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

DP-SGD: Differentially Private Stochastic Gradient Descent—the standard algorithm for private training that adds noise to clipped gradients.

Zeroth-Order Optimization: Optimization methods that estimate gradients using only function values (losses) rather than explicit derivatives.

SPSA: Simultaneous Perturbation Stochastic Approximation—a specific zeroth-order method that estimates gradients using random perturbations.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and trains rank-decomposition matrices.

Epsilon-DP: Pure Differential Privacy, a stronger guarantee where the privacy loss is strictly bounded by epsilon (delta=0).

Laplace Mechanism: A DP mechanism that adds noise drawn from a Laplace distribution, typically used for epsilon-DP.

Gaussian Mechanism: A DP mechanism that adds noise drawn from a Gaussian distribution, typically used for (epsilon, delta)-DP.

Per-example gradient clipping: The process in DP-SGD of scaling down individual sample gradients to bound their norm (sensitivity) before aggregation.

Sensitivity: The maximum amount by which a single individual's data can change the function output (in this case, the loss difference scalar).

Membership Inference Attack: An attack that attempts to determine whether a specific data point was used to train a machine learning model.