Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

📝 Paper Summary

LLM Alignment Safety Fine-tuning

Fine-tuning aligned LLMs without safety prompts but testing them with safety prompts (a strategy called PTST) prevents the catastrophic loss of safety alignment that typically occurs when using identical templates.

Core Problem

Fine-tuning aligned LLMs on benign, utility-oriented datasets (like math or coding) often causes them to catastrophically forget safety alignment, leading them to answer harmful queries.

Why it matters:

Even benign model creators fine-tuning on safe data can inadvertently deploy unsafe models
Existing solutions focus on filtering training data or adding safety examples, but safety degradation persists even with clean data
The common practice of matching training and testing prompt templates exacerbates the problem by overfitting to the fine-tuning distribution

Concrete Example: When Llama 2-Chat is fine-tuned on the benign GSM8K math dataset using the standard safety prompt for both training and testing, its Attack Success Rate (ASR) on harmful queries jumps from 0% to 18.08%.

Key Novelty

Pure Tuning, Safe Testing (PTST)

Intentionally introduce a distribution shift between fine-tuning and inference prompt templates
Fine-tune the model on downstream tasks *without* safety prompts (Pure Tuning) to focus on utility
Deploy the model *with* a safety prompt (Safe Testing) to trigger the pre-aligned safety mechanisms, which remain intact because they weren't overwritten during fine-tuning

Architecture

Conceptual flow of the PTST strategy compared to standard fine-tuning.

Evaluation Highlights

Reduces Attack Success Rate (ASR) on DirectHarm4 from 18.08% (standard matching templates) to 1.08% (PTST) for Llama 2-Chat fine-tuned on GSM8K
Maintains downstream utility: PTST achieves 30.00% accuracy on GSM8K, comparable to the 33.51% of standard fine-tuning
Effective even when safety examples are added to training: reduces ASR from high levels to near zero on mixed benign/harmful datasets

Breakthrough Assessment

8/10

Simple, counter-intuitive, and highly effective strategy that challenges the standard practice of matching train/test distributions. Requires no extra data or complex training objectives.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning an aligned base model M on a benign dataset D_benign to improve utility while maintaining safety against harmful queries

Inputs: Benign instruction tuning data (e.g., math problems)

Outputs: Fine-tuned model weights that are helpful on D_benign but refuse harmful queries

Pipeline Flow

User Input -> [Template Application] -> [Fine-tuned Model] -> Response

System Modules

Template Application

Wraps user query in a specific string format (e.g., adding [INST] tags or system prompts)

Model or implementation: Deterministic string formatting

Fine-tuned Model

Generates response based on templated input

Model or implementation: Llama 2-Chat / Mistral 7B Instruct / GPT-3.5 Turbo

Novel Architectural Elements

Mismatched Prompt Template Strategy: Using distinct templates for training vs. inference to disentangle utility learning from safety mechanism preservation

Modeling

Base Model: Llama-2-7b-chat, Mistral-7B-Instruct-v0.2, GPT-3.5-turbo-0613

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Minimize negative log-likelihood of the target response.

Formally: Standard Cross-Entropy Loss

Adaptation: Full fine-tuning

Trainable Parameters: All parameters (for open weights models)

Training Data:

GSM8K (7.5k train examples)
ChatDoctor
OpenOrca

Key Hyperparameters:

learning_rate: 1e-4
epochs: 6
batch_size: Not explicitly reported in the paper (GPT-3.5 API auto-selects)
+ 1 more
optimizer: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard SFT: PTST deliberately mismatches train/test templates to preserve safety alignment
vs. Safety Data Augmentation: PTST works without needing to curate or mix in safety data, though it is complementary to it
vs. Vacuum [not cited in paper]: Vacuum also explores prompt mismatched fine-tuning but for capability steering rather than safety preservation

Limitations

Depends on the base model having a safety prompt mechanism to begin with (works best on aligned models)
Slight degradation in downstream task performance compared to matched-template fine-tuning (e.g. ~3% drop on GSM8K)
Does not defend against sophisticated optimization-based jailbreaks (like GCG) as effectively as it does against direct harmful queries
Tested primarily on specific chat models; generalization to all future architectures is empirical

Reproducibility

Code: https://github.com/vfleaking/PTST

Code and the new DirectHarm4 dataset are publicly available. Llama 2 and Mistral models are open weights. GPT-3.5 Turbo fine-tuning relies on OpenAI API which is a black box (exact hyperparameters like batch size are auto-selected).

📊 Experiments & Results

Evaluation Setup

Fine-tuning aligned models on benign tasks (Math, Medical, General Instruction) and testing for safety degradation.

Benchmarks:

DirectHarm4 (Safety evaluation (400 harmful queries)) [New]
AdvBench (Safety evaluation (520 harmful behaviors))
GSM8K (Grade School Math (Utility))
ChatDoctor (Medical Advice (Utility))
OpenOrca (General Instruction Following (Utility))

Metrics:

Attack Success Rate (ASR) - % of harmful queries answered
Helpfulness (Accuracy/Exact Match on downstream task)
Statistical methodology: Fine-tuning repeated using three different seeds.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Fine-tuning Llama-2-7b-chat on GSM8K results. Standard practice (same template) degrades safety; PTST preserves it.
DirectHarm4	ASR	18.08	1.08	-17.00
GSM8K	Exact Match Accuracy	33.51	30.00	-3.51
Results on GPT-3.5 Turbo fine-tuning showing generalization of the phenomenon.
DirectHarm4	ASR	22.75	4.50	-18.25
Results with Safety Examples added (GSM8K + Harmful queries mix).
Mix (GSM8K + Harmful)	ASR	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Helpfulness (GSM8K accuracy) vs Safety (ASR) curves throughout the fine-tuning epochs.

ASR and Helpfulness for different combinations of training/testing templates, including mismatched safety prompts.

Main Takeaways

Using the same prompt template for fine-tuning and testing (standard practice) leads to significant safety degradation, even when the template includes a safety prompt.
PTST (Fine-tuning without safety prompt, Testing with one) consistently achieves low ASR (<5%) while maintaining competitive downstream performance.
The safety degradation is not just 'forgetting'; fine-tuning with the safety prompt actually makes the model *less* safe than fine-tuning without it, likely because the model learns to associate the safety prompt with the new utility task rather than safety.
Adding safety examples helps, but PTST provides additional robustness, especially against creative harmful queries not covered by the safety examples.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Instruction Fine-tuning (SFT)
Familiarity with RLHF and safety alignment
Knowledge of prompt templates and system prompts

Key Terms

ASR: Attack Success Rate—the percentage of harmful queries for which the model provides a harmful, compliant response instead of a refusal

PTST: Pure Tuning, Safe Testing—the proposed strategy of fine-tuning without safety prompts but including them during inference

DirectHarm4: A new dataset curated by the authors containing 400 harmful queries across 4 categories that tend to elicit high ASRs

AdvBench: A standard benchmark for evaluating LLM safety, consisting of harmful instructions

GSM8K: A dataset of grade school math word problems, used here as a benign fine-tuning task

System Prompt: A special instruction usually prepended to the conversation history to guide the model's behavior (e.g., 'You are a helpful assistant')

Llama 2-Chat: A specific aligned version of the Llama 2 model family, tuned for dialogue and safety

GCG: Greedy Coordinate Gradient—an optimization-based jailbreak attack that finds adversarial suffixes to force models to answer harmful queries