← Back to Paper List

Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models

Pin-Yu Chen, Han Shen, Payel Das, Tianyi Chen
IBM Research, Rensselaer Polytechnic Institute
arXiv.org (2025)
RL Reasoning Pretraining

📝 Paper Summary

LLM Safety Alignment Supervised Fine-Tuning (SFT) Theoretical Analysis of Deep Learning
This paper establishes a theoretical framework quantifying how data similarity, context overlap, and loss landscape geometry dictate the inevitable trade-off between safety preservation and capability improvement during LLM fine-tuning.
Core Problem
Fine-tuning aligned LLMs on downstream tasks inevitably degrades their safety guardrails (the safety-capability trade-off), but the field lacks a theoretical understanding of the fundamental limits governing this degradation.
Why it matters:
  • Enhancing model capability on specific tasks often breaks innate safety protections even when the fine-tuning data contains no malicious examples
  • Current mitigation strategies are empirical; without theoretical bounds, it is unclear how much safety must be sacrificed for a given capability gain
  • Understanding the impact of data distribution mismatch is critical for selecting appropriate proxy safety datasets when original alignment data is unavailable
Concrete Example: An LLM aligned on the Orca dataset to refuse harmful queries might lose this refusal ability after being fine-tuned on the Alpaca dataset to improve instruction following. The paper shows this degradation is worse when the fine-tuning data (Alpaca) has high context overlap with the safety data (Orca) but divergent target outputs.
Key Novelty
Theoretical Safety-Capability Trade-off Framework
  • Formalizes fine-tuning as a constrained optimization problem under two strategies: Alignment Loss Constraint (penalizing safety loss) and Alignment Parameter Constraint (restricting weight updates)
  • Derives theoretical bounds proving that safety degradation is controlled by the distribution mismatch between proxy and original safety data and the context overlap between safety and task data
Evaluation Highlights
  • Prove theoretically and validate empirically that fine-tuning on coding tasks (Commitpackft) preserves safety better than general text tasks (Alpaca) due to lower context overlap with safety data
  • Demonstrate that using proxy safety data generated by the same teacher (GPT-4) as the original alignment data significantly reduces the safety alignment gap compared to data from a different teacher
  • Show that parameter-constrained fine-tuning (restricting weights to a local neighborhood) limits capability improvement more severely than loss-constrained fine-tuning
Breakthrough Assessment
8/10
Provides a much-needed theoretical foundation for a widely observed phenomenon. The formal characterization of 'context overlap' and 'proxy data similarity' offers actionable insights for data selection.
×