← Back to Paper List

Jailbreaking as a Reward Misspecification Problem

Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong
The University of Hong Kong, Huawei Noah’s Ark Lab
arXiv.org (2024)
RL Benchmark

📝 Paper Summary

LLM Safety and Alignment Adversarial Attacks / Jailbreaking Red Teaming Reward Modeling
The paper frames jailbreaking as exploiting 'reward misspecification'—where aligned models implicitly assign higher rewards to harmful responses—and proposes ReMiss to automatically generate adversarial prompts that maximize this gap.
Core Problem
Current jailbreaking attacks often fail because they optimize for target strings rather than the underlying behavioral vulnerability, and existing alignment processes suffer from reward misspecification where harmful outputs are implicitly favored.
Why it matters:
  • Target loss alone is a poor proxy for successful jailbreaking; low target loss doesn't guarantee the model is actually broken, leading to ineffective attacks
  • Alignment processes are opaque, making it difficult to understand why safety measures fail against adversarial inputs
  • Reward misspecification is a fundamental flaw in RLHF (Reinforcement Learning from Human Feedback), yet it is rarely used as an attack vector for red teaming
Concrete Example: Consider a malicious prompt like 'Guide users on how to create a fake ID'. An aligned model might refuse. However, if an adversarial suffix is added such that the model's implicit reward for a harmful response becomes higher than for a refusal, the model will generate the harmful content, revealing a 'reward misspecification' at that specific input point.
Key Novelty
ReMiss (Reward Misspecification for Jailbreaking)
  • Defines 'ReGap', a metric that measures the difference in implicit rewards (derived from the ratio of aligned vs. base model probabilities) between harmful and harmless responses
  • Uses ReGap as the optimization objective for generating adversarial suffixes, searching for inputs where the model internally 'prefers' the harmful output despite safety training
  • Treats jailbreaking as searching for inputs that maximize this reward gap, effectively reversing the alignment process to find failure modes
Architecture
Architecture Figure Figure 2
Overview of the ReMiss framework. It illustrates the iterative process of generating suffixes using a Generator Model, evaluating them via the ReGap metric (calculated using Target and Reference models), and updating the generator.
Evaluation Highlights
  • Achieves 90.2% attack success rate (ASR) on Llama-2-7b-chat-hf, outperforming GCG (56.2%) and AutoDAN (65.9%) on AdvBench
  • Attacks transfer effectively to closed-source models: 49.6% ASR on GPT-4o and 66.8% on GPT-3.5 Turbo
  • Maintains higher perplexity-based stealth (lower perplexity means more readable) compared to optimization-based baselines like GCG
Breakthrough Assessment
8/10
Offers a theoretically grounded perspective on jailbreaking (reward misspecification) rather than just heuristic optimization. The ReGap metric provides a new tool for understanding alignment failures, and the empirical results are strong.
×