Jailbreaking as a Reward Misspecification Problem

📝 Paper Summary

LLM Safety and Alignment Adversarial Attacks / Jailbreaking Red Teaming Reward Modeling

The paper frames jailbreaking as exploiting 'reward misspecification'—where aligned models implicitly assign higher rewards to harmful responses—and proposes ReMiss to automatically generate adversarial prompts that maximize this gap.

Core Problem

Current jailbreaking attacks often fail because they optimize for target strings rather than the underlying behavioral vulnerability, and existing alignment processes suffer from reward misspecification where harmful outputs are implicitly favored.

Why it matters:

Target loss alone is a poor proxy for successful jailbreaking; low target loss doesn't guarantee the model is actually broken, leading to ineffective attacks
Alignment processes are opaque, making it difficult to understand why safety measures fail against adversarial inputs
Reward misspecification is a fundamental flaw in RLHF (Reinforcement Learning from Human Feedback), yet it is rarely used as an attack vector for red teaming

Concrete Example: Consider a malicious prompt like 'Guide users on how to create a fake ID'. An aligned model might refuse. However, if an adversarial suffix is added such that the model's implicit reward for a harmful response becomes higher than for a refusal, the model will generate the harmful content, revealing a 'reward misspecification' at that specific input point.

Key Novelty

ReMiss (Reward Misspecification for Jailbreaking)

Defines 'ReGap', a metric that measures the difference in implicit rewards (derived from the ratio of aligned vs. base model probabilities) between harmful and harmless responses
Uses ReGap as the optimization objective for generating adversarial suffixes, searching for inputs where the model internally 'prefers' the harmful output despite safety training
Treats jailbreaking as searching for inputs that maximize this reward gap, effectively reversing the alignment process to find failure modes

Architecture

Overview of the ReMiss framework. It illustrates the iterative process of generating suffixes using a Generator Model, evaluating them via the ReGap metric (calculated using Target and Reference models), and updating the generator.

Evaluation Highlights

Achieves 90.2% attack success rate (ASR) on Llama-2-7b-chat-hf, outperforming GCG (56.2%) and AutoDAN (65.9%) on AdvBench
Attacks transfer effectively to closed-source models: 49.6% ASR on GPT-4o and 66.8% on GPT-3.5 Turbo
Maintains higher perplexity-based stealth (lower perplexity means more readable) compared to optimization-based baselines like GCG

Breakthrough Assessment

8/10

Offers a theoretically grounded perspective on jailbreaking (reward misspecification) rather than just heuristic optimization. The ReGap metric provides a new tool for understanding alignment failures, and the empirical results are strong.

⚙️ Technical Details

Problem Definition

Setting: Adversarial suffix generation for aligned LLMs to elicit harmful behaviors

Inputs: Malicious instruction x and a target aligned model

Outputs: Adversarial suffix s such that the model follows the malicious instruction

Pipeline Flow

Generator Training: Train a generator model to produce suffixes minimizing ReGap
Suffix Generation: Generate candidate suffixes for a given harmful prompt
Selection: Select the best suffix based on the ReGap score

System Modules

Generator Model

Predicts adversarial suffixes given a malicious prompt

Model or implementation: Llama-2-7b-chat-hf (finetuned)

Target Model (Target Evaluation)

The aligned victim model being attacked; provides probabilities for ReGap calculation

Model or implementation: Various (Llama-2, Vicuna, Guanaco, etc.)

Reference Model (Target Evaluation)

The base model used to normalize probabilities and calculate implicit rewards

Model or implementation: Pre-trained base version of the target model (e.g., Llama-2-7b)

Novel Architectural Elements

ReGap-guided optimization loop: The attack objective is strictly defined by the reward misspecification metric (ReGap) rather than simple target string probability
Implicit Reward formulation: Utilizing the ratio between aligned and base model probabilities as a differentiable proxy for the (unknown) reward function

Modeling

Base Model: Llama-2-7b-chat-hf (as the primary generator and target in experiments)

Training Method: Finetuning generator on suffixes that minimize ReGap

Objective Functions:

Purpose: Minimize the reward gap (ReGap) to find misspecified regions.

Formally: Minimize L_ReGap(s) = log(1 + exp(ReGap(x, s)))
Purpose: Maintain readability of the suffix.

Formally: Regularization term L_ref(s|x) minimizing negative log probability of suffix under reference model

Key Hyperparameters:

alpha: 0.5 (weight for up-weighting target model log probability)
beta: 0.1 (weight for readability regularization)
learning_rate: Not explicitly reported in the paper
+ 1 more
batch_size: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. GCG: Optimizes ReGap instead of target string loss; produces more readable prompts; higher ASR
vs. AutoDAN: Uses ReGap as the fitness function rather than just target log-prob or classifier scores
vs. PAIR/TAP: Focuses on the theoretical root cause (reward misspecification) rather than black-box iterative refinement

Limitations

Requires access to a reference (base) model to calculate implicit rewards, which may not be available for all API-based models (though transfer attacks work)
Computational cost of calculating ReGap (requires inference on both aligned and reference models) is higher than simple loss-based methods
Relies on the assumption that jailbreaks correspond to reward misspecification, which holds empirically but may have edge cases

Reproducibility

Code: https://github.com/zhxieml/remiss-jailbreak

Code is publicly available at https://github.com/zhxieml/remiss-jailbreak. The paper relies on existing benchmarks (AdvBench, HarmBench) and standard models (Llama-2, Vicuna). Specific hyperparameters for the generator training loop (LR, batch size) are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Jailbreaking aligned LLMs using generated adversarial suffixes

Benchmarks:

AdvBench (Harmful behaviors (520 instances))
HarmBench (Diverse harmful behaviors (Standard and Contextual))

Metrics:

Attack Success Rate (ASR)
Perplexity (PPL)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Attack Success Rate (ASR) comparisons on open-source models using the AdvBench dataset.
AdvBench	ASR	65.9	90.2	+24.3
AdvBench	ASR	56.2	90.2	+34.0
AdvBench	ASR	46.2	90.2	+44.0
Evaluation of transferability to closed-source models using adversarial suffixes generated on open-source models.
AdvBench	ASR	39.6	49.6	+10.0
AdvBench	ASR	55.8	66.8	+11.0
Evaluation on the HarmBench benchmark across varying capabilities.
HarmBench	ASR	62.4	78.4	+16.0

Experiment Figures

Scatter plot comparing Target Loss vs. ReGap for differentiating successful and unsuccessful jailbreaks.

Heatmap of misspecification rates on models with implanted backdoors.

Main Takeaways

ReMiss consistently outperforms baselines (GCG, AutoDAN, PAIR, TAP) across multiple open-source models (Llama-2, Vicuna, Guanaco).
Attacks generated by ReMiss are highly transferable to closed-source models like GPT-4o and GPT-3.5, suggesting the vulnerabilities exploited are fundamental.
The ReGap metric effectively distinguishes between successful and unsuccessful jailbreaks where target loss alone fails, validating the reward misspecification hypothesis.
Generated prompts maintain lower perplexity (better readability) than GCG, making them harder to detect with simple filters.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF) concepts
Basics of adversarial attacks on LLMs (suffix optimization)
KL-divergence and probability distributions

Key Terms

ReGap: Reward Gap—a metric measuring the difference in implicit rewards assigned to a harmless response versus a harmful response; negative values indicate misspecification

implicit reward: The effective reward a model assigns to a response, derived from the log-ratio of the aligned model's probability to the reference (base) model's probability

GCG: Greedy Coordinate Gradient—a discrete optimization method for finding adversarial suffixes by swapping tokens to minimize target loss

AutoDAN: An automated jailbreak generation method that uses a genetic algorithm and hierarchical genetic search

AdvBench: A benchmark dataset of harmful behaviors used to evaluate jailbreaking attacks

ASR: Attack Success Rate—the percentage of malicious prompts for which the model generates a harmful response

RLHF: Reinforcement Learning from Human Feedback—a method to align language models using reward models trained on human preferences

perplexity: A measurement of how well a probability model predicts a sample; in this context, used as a proxy for the fluency/readability of the adversarial prompt