← Back to Paper List

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, Sanjeev Arora
Computer Science Department & Princeton Language and Intelligence, Princeton University, Institute for Interdisciplinary Information Sciences, Tsinghua University, ByteDance
Neural Information Processing Systems (2024)
RL Reasoning Benchmark

📝 Paper Summary

LLM Alignment Safety Fine-tuning
Fine-tuning aligned LLMs without safety prompts but testing them with safety prompts (a strategy called PTST) prevents the catastrophic loss of safety alignment that typically occurs when using identical templates.
Core Problem
Fine-tuning aligned LLMs on benign, utility-oriented datasets (like math or coding) often causes them to catastrophically forget safety alignment, leading them to answer harmful queries.
Why it matters:
  • Even benign model creators fine-tuning on safe data can inadvertently deploy unsafe models
  • Existing solutions focus on filtering training data or adding safety examples, but safety degradation persists even with clean data
  • The common practice of matching training and testing prompt templates exacerbates the problem by overfitting to the fine-tuning distribution
Concrete Example: When Llama 2-Chat is fine-tuned on the benign GSM8K math dataset using the standard safety prompt for both training and testing, its Attack Success Rate (ASR) on harmful queries jumps from 0% to 18.08%.
Key Novelty
Pure Tuning, Safe Testing (PTST)
  • Intentionally introduce a distribution shift between fine-tuning and inference prompt templates
  • Fine-tune the model on downstream tasks *without* safety prompts (Pure Tuning) to focus on utility
  • Deploy the model *with* a safety prompt (Safe Testing) to trigger the pre-aligned safety mechanisms, which remain intact because they weren't overwritten during fine-tuning
Architecture
Architecture Figure Figure 1 (implied)
Conceptual flow of the PTST strategy compared to standard fine-tuning.
Evaluation Highlights
  • Reduces Attack Success Rate (ASR) on DirectHarm4 from 18.08% (standard matching templates) to 1.08% (PTST) for Llama 2-Chat fine-tuned on GSM8K
  • Maintains downstream utility: PTST achieves 30.00% accuracy on GSM8K, comparable to the 33.51% of standard fine-tuning
  • Effective even when safety examples are added to training: reduces ASR from high levels to near zero on mixed benign/harmful datasets
Breakthrough Assessment
8/10
Simple, counter-intuitive, and highly effective strategy that challenges the standard practice of matching train/test distributions. Requires no extra data or complex training objectives.
×