โ† Back to Paper List

(Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs

Eugene Bagdasarian, Tsung-Yin Hsieh, Ben Nassi, Vitaly Shmatikov
arXiv.org (2023)
MM Speech Benchmark

๐Ÿ“ Paper Summary

Adversarial Machine Learning LLM Security
Adversarial perturbations blended into images and audio can inject hidden instructions into multi-modal LLMs, enabling indirect prompt injection attacks where the model executes commands unbeknownst to the user.
Core Problem
Multi-modal LLMs accepting images and audio are vulnerable to indirect prompt injection via these modalities, allowing attackers to hide instructions in media that humans perceive as benign.
Why it matters:
  • Expands the attack surface beyond text; malicious instructions can be hidden in audio or images where users cannot detect them visually or aurally
  • Enables attacks on isolated systems by leveraging unwitting human users as vectors (e.g., a user manually uploading a 'cursed' image found online)
  • Current defenses for text injection do not account for adversarial perturbations in continuous signal modalities like pixel or audio data
Concrete Example: An attacker blends a hidden text prompt into an audio recording. When a user asks PandaGPT 'describe this sound', the model instead outputs the attacker's target string or follows a hidden instruction (e.g., steering the conversation) without the user realizing the audio contained speech.
Key Novelty
Multi-Modal Indirect Prompt Injection via Adversarial Perturbations
  • Apply adversarial example generation (gradient-based optimization) to input images/audio to force the model to generate a specific text sequence (the injected prompt)
  • Leverage the auto-regressive nature of dialog systems: once the model outputs the injected prompt (e.g., as a caption), that prompt enters the conversation history and steers future interactions (Dialog Poisoning)
Architecture
Architecture Figure Figure 4
The adversarial generation process using teacher-forcing
Evaluation Highlights
  • Demonstrated successful targeted-output attacks against LLaVA (image inputs) and PandaGPT (image and audio inputs) using adversarial perturbations
  • Achieved 'Dialog Poisoning' where an injected instruction (e.g., in an image) successfully steered the model's behavior in subsequent conversation turns
  • Visual/Auditory content preservation: The adversarial perturbations do not significantly destroy the semantic content, allowing the model to still converse about the image while following the hidden instruction
Breakthrough Assessment
8/10
Significantly expands the threat model for LLMs by demonstrating that non-text modalities can be used as vectors for indirect injection, bypassing text-only filters.
×