Evaluation Setup
Black-box attack on GPT-4o voice mode using TTS-generated audio prompts played from a laptop to a phone
Benchmarks:
- ForbiddenQuestionSet (Answering questions violating OpenAI usage policy)
Metrics:
- Attack Success Rate (ASR)
- Utility (Duration, Word Count, Readability)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| VoiceJailbreak significantly outperforms baselines (direct questions and text jailbreaks) in bypassing GPT-4o's voice safeguards. |
| ForbiddenQuestionSet (Average) |
ASR |
0.033 |
0.778 |
+0.745
|
| ForbiddenQuestionSet (Illegal Activity) |
ASR |
0.000 |
See main_takeaways |
Not reported in the paper
|
| ForbiddenQuestionSet (All Scenarios) |
ASR |
0.033 |
0.100 |
+0.067
|
| ForbiddenQuestionSet (Pornography) |
ASR |
0.400 |
0.600 |
+0.200
|
Main Takeaways
- GPT-4o's voice mode has strong internal safeguards against direct forbidden questions and transferred text jailbreak prompts (ASR < 0.100)
- Text jailbreak prompts are generally too long (avg 171 seconds) for voice interactions and contain pauses that trigger premature model responses
- Fictional storytelling (VoiceJailbreak) effectively bypasses voice safeguards by establishing a harmless context (Setting/Character/Plot), raising average ASR to 0.778
- Advanced literary techniques like Foreshadowing can further enhance attack success in specific scenarios (e.g., +20% ASR in Pornography)