Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models

📝 Paper Summary

Adversarial Attacks on LLMs Red Teaming Jailbreaking

DeGCG improves adversarial attack efficiency by decoupling suffix search into a behavior-agnostic first-token pre-search and a behavior-relevant content-aware post-search, leveraging transferability between these stages.

Core Problem

Gradient-based attacks like GCG are computationally inefficient due to the vast search space and ineffective random initialization, leading to poor transferability across models and domains.

Why it matters:

LLMs remain vulnerable to jailbreaks despite safety alignment, posing significant risks if malicious users can automate attacks efficiently.
Existing methods struggle to transfer adversarial suffixes between models, requiring expensive restart of the search process for every new target or model.
Optimizing the full target sequence simultaneously introduces noise, making the critical first step (bypassing refusal) difficult to achieve.

Concrete Example: When attacking a model to 'make a bomb', standard GCG initializes with random tokens and tries to optimize the entire phrase 'Sure, here is how...', often getting stuck. DeGCG first finds a suffix that simply elicits 'Sure' (easier), then uses that as a starting point to optimize the full specific harmful response.

Key Novelty

DeGCG (Decoupled Greedy Coordinate Gradient) & i-DeGCG

Decouples the attack into two stages: First-Token Searching (FTS) to find a suffix that elicits a simple acknowledgment like 'Sure', and Content-Aware Searching (CAS) to fine-tune it for specific harmful content.
Treats the FTS suffix as a transferable 'pre-trained' initialization that places the search in a favorable area of the discrete token space for the harder CAS task.
Introduces i-DeGCG, an interleaved variant that iteratively alternates between FTS and CAS to continuously refine the suffix using self-transferability.

Architecture

Overview of the DeGCG framework, illustrating the two-stage process.

Evaluation Highlights

Achieves 43.9% Attack Success Rate (ASR) on Llama2-chat-7b (valid set), outperforming the GCG-M baseline by +22.2%.
Demonstrates strong cross-model transfer: Transferring from Mistral-Instruct to Llama2-chat yields +22.2% ASR improvement on validation set.
i-DeGCG variant achieves 90.6% ASR on OpenChat-3.5 test set, significantly outperforming standard GCG baselines.

Breakthrough Assessment

7/10

Significant improvement in attack efficiency and transferability rates compared to standard GCG. The two-stage decoupling is a clever, intuitively sound strategy that practically breaks down the optimization difficulty.

⚙️ Technical Details

Problem Definition

Setting: Adversarial suffix optimization for LLM jailbreaking

Inputs: Malicious query X (e.g., 'Tell me how to make a bomb')

Outputs: Adversarial suffix S that causes the model to generate target Y (e.g., 'Sure, here is how to make a bomb')

Pipeline Flow

Pre-Searching (First-Token Searching)
Post-Searching (Content-Aware Searching)
Evaluation/Generation

System Modules

First-Token Searcher (FTS)

Optimize suffix S to minimize loss on a generic target token (e.g., 'Sure')

Model or implementation: Target LLM (e.g., Llama-2-7b-chat)

Content-Aware Searcher (CAS)

Fine-tune S_FTS to minimize loss on the full specific target sequence

Model or implementation: Target LLM (Same or different from FTS)

Interleaved Controller (i-DeGCG only)

Alternates between FTS and CAS in a loop

Model or implementation: N/A (Algorithm logic)

Novel Architectural Elements

Two-stage decoupling of the discrete optimization process into behavior-agnostic pre-search and behavior-specific post-search
Interleaved meta-process (i-DeGCG) that cycles between objectives to refine the suffix

Modeling

Base Model: Llama2-chat-7b, Mistral-Instruct-7b, OpenChat-3.5-7b, Starling-LM-alpha-7b

Key Hyperparameters:

total_search_steps: 500
suffix_length: 20 (default), up to 100 for scaling experiments
top_k_candidates: Not explicitly reported in the paper (standard GCG usually uses 256)
+ 1 more
batch_size: Not explicitly reported in the paper (standard GCG usually uses 512)

Compute: 7b models used due to memory constraints. Specific GPU hours not reported.

Comparison to Prior Work

vs. GCG-M: DeGCG decouples the search into two stages (FTS + CAS) instead of optimizing the full target from scratch.
vs. GCG-T: DeGCG uses transfer learning (FTS on source -> CAS on target) rather than joint optimization on multiple models.
vs. AutoDAN [not cited in paper]: DeGCG focuses on gradient-based suffix optimization rather than genetic algorithms or manual prompt engineering.

Limitations

Experiments limited to 7B parameter models due to memory constraints.
Performance on 'Harmful' and 'Harassment Bully' categories is lower, possibly due to limited data size.
Larger search spaces (suffix length > 20) introduce complexity that standard baselines struggle with, though i-DeGCG mitigates this.

Reproducibility

Code: https://github.com/Waffle-Liu/DeGCG

Code is publicly available at https://github.com/Waffle-Liu/DeGCG. Uses open-source models (Llama2, Mistral, OpenChat, Starling) and HarmBench dataset. Classifier for evaluation is a fine-tuned Llama2-13b provided by HarmBench.

📊 Experiments & Results

Evaluation Setup

Jailbreak attack success rate on aligned LLMs using harmful queries.

Benchmarks:

HarmBench (Safety evaluation / Jailbreaking)

Metrics:

Attack Success Rate (ASR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Self-transfer results using i-DeGCG show significant improvements over the GCG-M baseline on validation and test sets.
HarmBench (Llama2-chat-7b)	ASR (Valid)	21.7	43.9	+22.2
HarmBench (Llama2-chat-7b)	ASR (Test)	19.5	39.0	+19.5
Cross-model transfer results demonstrate that FTS on a source model provides effective initialization for CAS on a target model.
HarmBench (Mistral -> Llama2)	ASR (Valid)	21.7	43.9	+22.2
HarmBench (Starling -> OpenChat)	ASR (Valid)	82.5	91.5	+9.0
Cross-data transfer results show DeGCG improves performance on specific domains when initialized with generic FTS.
HarmBench (Chemical Biological)	ASR	10.0	20.0	+10.0

Experiment Figures

Loss landscape comparison between first-token optimization and full-sequence optimization.

ASR performance of DeGCG vs GCG-M across specific semantic domains (Cross-Data Transfer).

Main Takeaways

Optimizing the first token ('Sure') is the primary bottleneck; once bypassed, generating the rest of the harmful content is significantly easier.
Adversarial suffixes transfer well across models when used as initialization for further tuning, even between different tokenizers.
Interleaved training (i-DeGCG) effectively handles larger search spaces where static baselines fail, maintaining high ASR even as suffix length increases.

📚 Prerequisite Knowledge

Prerequisites

Gradient-based adversarial attacks (GCG)
Language Model alignment and jailbreaking
Transfer learning concepts (pre-training/fine-tuning)

Key Terms

GCG: Greedy Coordinate Gradient—an algorithm that uses gradients to identify promising token replacements for optimizing adversarial suffixes

ASR: Attack Success Rate—the percentage of malicious queries for which the model generates a harmful response instead of a refusal

FTS: First-Token Searching—optimizing a suffix solely to elicit a behavior-agnostic first token (e.g., 'Sure')

CAS: Content-Aware Searching—fine-tuning a suffix to elicit a specific behavior-relevant response (e.g., 'Sure, here is how to make a bomb')

Transferability: The ability of an adversarial suffix optimized on one model or task to work effectively on another

Cross-entropy loss: A loss function measuring the difference between the predicted probability distribution and the target distribution