← Back to Paper List

Voice Jailbreak Attacks Against GPT-4o

Xinyue Shen, Yixin Wu, Michael Backes, Yang Zhang
arXiv (2024)
MM Speech Benchmark

📝 Paper Summary

Adversarial Attacks on MLLMs AI Safety and Alignment
VoiceJailbreak bypasses GPT-4o's voice safeguards by wrapping forbidden questions in fictional storytelling contexts, significantly outperforming direct queries and converted text jailbreaks.
Core Problem
GPT-4o's voice mode is resistant to standard text jailbreak prompts because they are often too long or contain pauses that trigger premature model responses/refusals.
Why it matters:
  • The rapid adoption of multimodal assistants like GPT-4o introduces new attack surfaces (voice) that are not fully understood
  • Existing text-based jailbreaks do not transfer effectively to voice mode due to duration constraints and different processing mechanisms (e.g., pause detection)
  • Directly transferring attacks fails, creating a false sense of security regarding the robustness of audio modalities
Concrete Example: When a text jailbreak prompt starts with 'Let's play a game,' the natural pause after the sentence causes GPT-4o to interrupt and respond immediately, missing the subsequent forbidden question hidden in the audio.
Key Novelty
VoiceJailbreak (Fictional Storytelling Attack)
  • Humanizes the MLLM interaction by framing the attack as a fictional story using three key elements: Setting (e.g., a game world), Character (e.g., a hacker), and Plot (the forbidden question)
  • Utilizes advanced literary techniques like Point of View (POV), Red Herrings, and Foreshadowing to further disguise malicious intent within the narrative flow
Architecture
Architecture Figure Figure 3
The construction flow of the VoiceJailbreak attack
Evaluation Highlights
  • Increases average Attack Success Rate (ASR) on GPT-4o from 0.033 (baseline) to 0.778 across six forbidden scenarios
  • Demonstrates that text jailbreak prompts transferred to audio are ineffective, achieving ASRs consistently below 0.100
  • Using the 'Foreshadowing' technique specifically boosts ASR in the Pornography scenario from 0.400 to 0.600
Breakthrough Assessment
8/10
First systematic measurement of jailbreak attacks on GPT-4o's voice mode, exposing a critical vulnerability in multimodal safeguards that text-only defenses miss.
×