Voice Jailbreak Attacks Against GPT-4o

📝 Paper Summary

Adversarial Attacks on MLLMs AI Safety and Alignment

VoiceJailbreak bypasses GPT-4o's voice safeguards by wrapping forbidden questions in fictional storytelling contexts, significantly outperforming direct queries and converted text jailbreaks.

Core Problem

GPT-4o's voice mode is resistant to standard text jailbreak prompts because they are often too long or contain pauses that trigger premature model responses/refusals.

Why it matters:

The rapid adoption of multimodal assistants like GPT-4o introduces new attack surfaces (voice) that are not fully understood
Existing text-based jailbreaks do not transfer effectively to voice mode due to duration constraints and different processing mechanisms (e.g., pause detection)
Directly transferring attacks fails, creating a false sense of security regarding the robustness of audio modalities

Concrete Example: When a text jailbreak prompt starts with 'Let's play a game,' the natural pause after the sentence causes GPT-4o to interrupt and respond immediately, missing the subsequent forbidden question hidden in the audio.

Key Novelty

VoiceJailbreak (Fictional Storytelling Attack)

Humanizes the MLLM interaction by framing the attack as a fictional story using three key elements: Setting (e.g., a game world), Character (e.g., a hacker), and Plot (the forbidden question)
Utilizes advanced literary techniques like Point of View (POV), Red Herrings, and Foreshadowing to further disguise malicious intent within the narrative flow

Architecture

The construction flow of the VoiceJailbreak attack

Evaluation Highlights

Increases average Attack Success Rate (ASR) on GPT-4o from 0.033 (baseline) to 0.778 across six forbidden scenarios
Demonstrates that text jailbreak prompts transferred to audio are ineffective, achieving ASRs consistently below 0.100
Using the 'Foreshadowing' technique specifically boosts ASR in the Pornography scenario from 0.400 to 0.600

Breakthrough Assessment

8/10

First systematic measurement of jailbreak attacks on GPT-4o's voice mode, exposing a critical vulnerability in multimodal safeguards that text-only defenses miss.

⚙️ Technical Details

Problem Definition

Setting: Black-box adversarial attack against the voice mode of a Multimodal Large Language Model (MLLM)

Inputs: Forbidden question q converted into a voice prompt via Text-to-Speech (TTS)

Outputs: Voice response from the target MLLM

Pipeline Flow

Jailbreak Construction (Adversary)
Audio Generation (TTS)
Attack Execution (Device Playback)

System Modules

Jailbreak Constructor

Wraps forbidden question in fictional elements (Setting, Character, Plot)

Model or implementation: Manual Construction / Template

Audio Generator

Converts text prompt to natural-sounding audio

Model or implementation: OpenAI TTS (model: tts-1, voice: Fable)

Target Model

Receives audio input and generates response

Model or implementation: GPT-4o (via ChatGPT App)

Novel Architectural Elements

Application of literary theory (Setting, Character, Plot) specifically to construct audio-modality adversarial prompts

Modeling

Base Model: GPT-4o (accessed via ChatGPT Plus subscription)

Training Method: Inference-time adversarial attack (Prompt Engineering)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Text Jailbreaks: VoiceJailbreak is optimized for audio duration and speech patterns, whereas text jailbreaks fail due to length (avg 171s) and pause-triggered interruptions
vs. Visual Jailbreaks: Targets the audio modality's specific safeguards and processing characteristics rather than visual inputs
vs. GCG (Greedy Coordinate Gradient) [not cited in paper]: GCG optimizes discrete tokens for text attacks; VoiceJailbreak uses semantic-level storytelling optimized for natural speech flow

Limitations

Experiments conducted manually using two registered accounts, limiting scale
Relies on TTS voices (Fable/Nova/Onyx) rather than diverse human voices
Attack success depends on black-box access to the ChatGPT app, which may change
Manual evaluation of ASR introduces potential subjectivity

Reproducibility

Code: https://github.com/TrustAIRLab/VoiceJailbreakAttack

Code and data available at https://github.com/TrustAIRLab/VoiceJailbreakAttack. Experiments rely on manual playback of TTS audio to a phone running the ChatGPT app. Evaluation involves manual labeling by authors.

📊 Experiments & Results

Evaluation Setup

Black-box attack on GPT-4o voice mode using TTS-generated audio prompts played from a laptop to a phone

Benchmarks:

ForbiddenQuestionSet (Answering questions violating OpenAI usage policy)

Metrics:

Attack Success Rate (ASR)
Utility (Duration, Word Count, Readability)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
VoiceJailbreak significantly outperforms baselines (direct questions and text jailbreaks) in bypassing GPT-4o's voice safeguards.
ForbiddenQuestionSet (Average)	ASR	0.033	0.778	+0.745
ForbiddenQuestionSet (Illegal Activity)	ASR	0.000	See main_takeaways	Not reported in the paper
ForbiddenQuestionSet (All Scenarios)	ASR	0.033	0.100	+0.067
ForbiddenQuestionSet (Pornography)	ASR	0.400	0.600	+0.200

Experiment Figures

Case studies of failed attacks vs. successful VoiceJailbreak

Main Takeaways

GPT-4o's voice mode has strong internal safeguards against direct forbidden questions and transferred text jailbreak prompts (ASR < 0.100)
Text jailbreak prompts are generally too long (avg 171 seconds) for voice interactions and contain pauses that trigger premature model responses
Fictional storytelling (VoiceJailbreak) effectively bypasses voice safeguards by establishing a harmless context (Setting/Character/Plot), raising average ASR to 0.778
Advanced literary techniques like Foreshadowing can further enhance attack success in specific scenarios (e.g., +20% ASR in Pornography)

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) jailbreaking concepts
Familiarity with Multimodal AI (audio/text integration)
Basic knowledge of prompt engineering

Key Terms

MLLM: Multimodal Large Language Model—an AI model capable of processing and generating multiple types of media (text, audio, images)

Jailbreak: An adversarial attack designed to bypass an AI model's safety filters to elicit forbidden or harmful responses

ASR: Attack Success Rate—the proportion of adversarial attempts that successfully induce the model to provide a harmful response

TTS: Text-to-Speech—technology that converts written text into spoken audio

POV: Point of View—a literary technique (first-person vs. third-person) used here to distance the model from the harmful act

Red Herring: A misleading clue or distraction used to divert the model's attention from the true malicious intent of the prompt

Foreshadowing: A literary device where hints are given about future events, used here to prime the model for a forbidden question