GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

📝 Paper Summary

Safety Alignment Continual Learning Synthetic Data Generation

GR-SAP preserves model safety during downstream fine-tuning by mixing in synthetic safety data generated by the model itself, which acts as a reliable proxy for inaccessible original alignment data.

Core Problem

Fine-tuning LLMs on downstream tasks degrades safety alignment (catastrophic forgetting), and preserving it is difficult because original alignment data is rarely public.

Why it matters:

Seemingly benign fine-tuning on math or reasoning tasks can unintentionally break safety guardrails, causing models to answer harmful queries
Open-source safety datasets often have different distributions than the model's original training data, leading to ineffective protection or even further degradation
Reliable safety preservation is critical for domain adaptation of open-weight models where original data is proprietary

Concrete Example: When Llama-3-8B-Instruct is fine-tuned on GSM8K (math), its refusal rate on the WildJailbreak benchmark drops significantly, causing the ratio of harmful responses to rise from 10.5% to 22.88%. Simply mixing in external safety data like Beavertails fails to fix this and can even spike harmfulness to 31.60%.

Key Novelty

Generative Replay for Safety Alignment Preservation (GR-SAP)

Treats the LLM as its own safety data generator: the model synthesizes safety queries and responses which approximate the original, undisclosed alignment distribution
Uses a 'revise-and-include' strategy: intentionally includes originally unsafe responses that have been corrected by a guardrail, treating them as high-value 'difficult' training examples
Theoretically bounds the safety gap by decomposing the divergence between synthetic and original data into query shift and alignment residual

Evaluation Highlights

Reduces harmful response ratio on Llama-3-8B-Instruct from 6.28% (unmixed baseline) to 0.58% after fine-tuning, while maintaining downstream accuracy
Prevents safety degradation on WildJailbreak: where unmixed training spikes to >20% harmfulness, GR-SAP maintains <1% harmfulness throughout training
Outperforms open-source safety datasets (e.g., Beavertails), which can catastrophically degrade safety (spiking Llama3 harmfulness to 31.60%) due to distribution mismatch

Breakthrough Assessment

8/10

Offers a practical, theoretically grounded solution to a widespread problem (safety forgetting) without requiring access to proprietary data. The finding that self-generated data outperforms external safety datasets is significant.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) on downstream task data while preserving safety alignment defined by an inaccessible distribution

Inputs: Downstream task dataset, Pre-trained Instruction Tuned LLM

Outputs: Fine-tuned LLM with preserved safety alignment

Modeling

Base Model: Llama-3-8B-Instruct, OLMo-2-1124-7B-SFT, Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3

Training Method: Supervised Fine-Tuning (SFT) with mixed synthetic data

Objective Functions:

Purpose: Optimize model on downstream task while regularizing with synthetic safety data.

Formally: min L(θ) = (1-r) * E[ -log P(y|x) on Task Data ] + r * E[ -log P(y|x) on Synthetic Data ]

Adaptation: Full fine-tuning (implied by context of SFT on 7B models)

Training Data:

Synthetic queries generated by model using specific prompt template (Appendix C.1)
Queries filtered by perplexity, deduplication, and relevance
Responses generated by model; unsafe responses revised by guardrail to be safe
Mixture ratio r=0.1 (10% synthetic safety data, 90% downstream task data)

Key Hyperparameters:

mixing_ratio_r: 0.1

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard SFT: Adds regularization via synthetic data to prevent safety forgetting
vs. Data Mixing (Open Source): Uses self-generated data which minimizes distribution shift (query shift) compared to external datasets [demonstrated in Table 1]
vs. Magpie: Uses tailored prompts for domain-specific generation rather than relying on system prompts, which some models lack [not cited in paper as direct baseline, but methodologically distinct]
+ 1 more
vs. Unforgotten: Uses generative replay specifically for safety preservation rather than general continual learning [not cited in paper]

Limitations

Relies on a guardrail model to filter and revise synthetic data; the quality of the guardrail limits the quality of the safety data
Requires the model to already be safety-aligned before fine-tuning (cannot align a base model from scratch)
Computational cost of generating synthetic data before fine-tuning is an overhead compared to simple SFT

Reproducibility

Code: https://github.com/chili-lab/gr-sap

Code available at https://github.com/chili-lab/gr-sap. Synthetic data generation templates provided in Appendix. Original alignment data only available for OLMo2 (used as baseline).

📊 Experiments & Results

Evaluation Setup

Fine-tuning aligned LLMs on downstream tasks and evaluating both task performance and safety retention

Benchmarks:

GSM8K (Mathematical reasoning)
MATH (Mathematical reasoning)
WildJailbreak (Safety evaluation (adversarial))
Beavertails (Safety evaluation)
HellaSwag (Common sense reasoning)

Metrics:

Harmful Score (HS)
Downstream Task Accuracy (Acc)
Statistical methodology: Experiments repeated three times; averages reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Safety preservation results on Llama-3-8B-Instruct showing GR-SAP effectiveness compared to unmixed and open-source data mixing.
Average Safety (4 datasets)	Harmful Score (HS)	6.28	0.58	-5.70
Average Safety (4 datasets)	Harmful Score (HS)	31.60	0.58	-31.02
Average Safety (4 datasets)	Harmful Score (HS)	38.75	6.89	-31.86
Average Safety (4 datasets)	Harmful Score (HS)	6.86	1.15	-5.71

Experiment Figures

Training dynamics of Harmful Score (HS) over training steps for Llama3, Qwen, Mistral, and OLMo2 on GSM8K and MATH tasks

Main Takeaways

Fine-tuning on benign tasks (like Math) consistently degrades safety alignment in unmixed settings across all tested models (Llama3, Mistral, Qwen, OLMo2)
Mixing in open-source safety datasets (Beavertails, AEGIS) often fails to preserve safety and can sometimes drastically increase harmfulness due to distribution mismatch
GR-SAP synthetic data achieves safety preservation comparable to using the original alignment data (validated on OLMo2 where original data is available)
Including 'difficult' cases (revised unsafe responses) is more effective than only using 'easy' safe responses

📚 Prerequisite Knowledge

Prerequisites

Understanding of Supervised Fine-Tuning (SFT) and Catastrophic Forgetting
Familiarity with Safety Alignment (Refusal training)
Basic knowledge of KL Divergence

Key Terms

Generative Replay: A continual learning technique where a model generates samples of its past knowledge to retrain itself, preventing forgetting

Safety Alignment: Training a model to adhere to safety criteria (e.g., helpful and harmless), typically by refusing to answer malicious queries

KL Divergence: Kullback-Leibler divergence—a statistical measure of how one probability distribution differs from a second, reference probability distribution

SFT: Supervised Fine-Tuning—training a pre-trained model on labeled examples to adapt it to a specific task

Harmful Score: The percentage of model responses to malicious queries that are classified as unsafe by a guardrail model

Perplexity: A measurement of how well a probability model predicts a sample; used here to filter out low-quality synthetic text

Guardrail Model: A separate classifier used to evaluate whether an LLM's output is safe or harmful

Alignment Residual: The difference between the model's current behavior and the ideal safety behavior, which GR-SAP aims to minimize