← Back to Paper List

GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

Zhouxiang Fang, Jiawei Zhou, Hanjie Chen
Rice University, Stony Brook University
arXiv (2026)
RL Pretraining Reasoning Benchmark

📝 Paper Summary

Safety Alignment Continual Learning Synthetic Data Generation
GR-SAP preserves model safety during downstream fine-tuning by mixing in synthetic safety data generated by the model itself, which acts as a reliable proxy for inaccessible original alignment data.
Core Problem
Fine-tuning LLMs on downstream tasks degrades safety alignment (catastrophic forgetting), and preserving it is difficult because original alignment data is rarely public.
Why it matters:
  • Seemingly benign fine-tuning on math or reasoning tasks can unintentionally break safety guardrails, causing models to answer harmful queries
  • Open-source safety datasets often have different distributions than the model's original training data, leading to ineffective protection or even further degradation
  • Reliable safety preservation is critical for domain adaptation of open-weight models where original data is proprietary
Concrete Example: When Llama-3-8B-Instruct is fine-tuned on GSM8K (math), its refusal rate on the WildJailbreak benchmark drops significantly, causing the ratio of harmful responses to rise from 10.5% to 22.88%. Simply mixing in external safety data like Beavertails fails to fix this and can even spike harmfulness to 31.60%.
Key Novelty
Generative Replay for Safety Alignment Preservation (GR-SAP)
  • Treats the LLM as its own safety data generator: the model synthesizes safety queries and responses which approximate the original, undisclosed alignment distribution
  • Uses a 'revise-and-include' strategy: intentionally includes originally unsafe responses that have been corrected by a guardrail, treating them as high-value 'difficult' training examples
  • Theoretically bounds the safety gap by decomposing the divergence between synthetic and original data into query shift and alignment residual
Evaluation Highlights
  • Reduces harmful response ratio on Llama-3-8B-Instruct from 6.28% (unmixed baseline) to 0.58% after fine-tuning, while maintaining downstream accuracy
  • Prevents safety degradation on WildJailbreak: where unmixed training spikes to >20% harmfulness, GR-SAP maintains <1% harmfulness throughout training
  • Outperforms open-source safety datasets (e.g., Beavertails), which can catastrophically degrade safety (spiking Llama3 harmfulness to 31.60%) due to distribution mismatch
Breakthrough Assessment
8/10
Offers a practical, theoretically grounded solution to a widespread problem (safety forgetting) without requiring access to proprietary data. The finding that self-generated data outperforms external safety datasets is significant.
×