(Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs

📝 Paper Summary

Adversarial Machine Learning LLM Security

Adversarial perturbations blended into images and audio can inject hidden instructions into multi-modal LLMs, enabling indirect prompt injection attacks where the model executes commands unbeknownst to the user.

Core Problem

Multi-modal LLMs accepting images and audio are vulnerable to indirect prompt injection via these modalities, allowing attackers to hide instructions in media that humans perceive as benign.

Why it matters:

Expands the attack surface beyond text; malicious instructions can be hidden in audio or images where users cannot detect them visually or aurally
Enables attacks on isolated systems by leveraging unwitting human users as vectors (e.g., a user manually uploading a 'cursed' image found online)
Current defenses for text injection do not account for adversarial perturbations in continuous signal modalities like pixel or audio data

Concrete Example: An attacker blends a hidden text prompt into an audio recording. When a user asks PandaGPT 'describe this sound', the model instead outputs the attacker's target string or follows a hidden instruction (e.g., steering the conversation) without the user realizing the audio contained speech.

Key Novelty

Multi-Modal Indirect Prompt Injection via Adversarial Perturbations

Apply adversarial example generation (gradient-based optimization) to input images/audio to force the model to generate a specific text sequence (the injected prompt)
Leverage the auto-regressive nature of dialog systems: once the model outputs the injected prompt (e.g., as a caption), that prompt enters the conversation history and steers future interactions (Dialog Poisoning)

Architecture

The adversarial generation process using teacher-forcing

Evaluation Highlights

Demonstrated successful targeted-output attacks against LLaVA (image inputs) and PandaGPT (image and audio inputs) using adversarial perturbations
Achieved 'Dialog Poisoning' where an injected instruction (e.g., in an image) successfully steered the model's behavior in subsequent conversation turns
Visual/Auditory content preservation: The adversarial perturbations do not significantly destroy the semantic content, allowing the model to still converse about the image while following the hidden instruction

Breakthrough Assessment

8/10

Significantly expands the threat model for LLMs by demonstrating that non-text modalities can be used as vectors for indirect injection, bypassing text-only filters.

⚙️ Technical Details

Problem Definition

Setting: Targeted adversarial attack on multi-modal sequence-to-sequence models

Inputs: Benign image/audio x^I, Target prompt/instruction w

Outputs: Perturbed input x^{I,w} that causes the model to output w

Pipeline Flow

Input Processing (Image/Audio)
Multi-modal Encoding (CLIP/ImageBind)
Projection (Linear Layer)
Language Modeling (Vicuna/LLaMA)

System Modules

Adversarial Input

The compromised image or audio file containing the gradient-optimized perturbation

Model or implementation: Perturbed Image x*

Vision/Audio Encoder

Encodes the raw modality into a feature vector

Model or implementation: CLIP ViT-L/14 (LLaVA) or ImageBind (PandaGPT)

LLM Decoder

Generates text response based on the encoded multi-modal input and text prompt

Model or implementation: Vicuna (based on LLaMA-7B)

Novel Architectural Elements

Application of teacher-forcing optimization directly to the input modality (pixels/audio) to force specific text token generation in a multi-modal decoder

Modeling

Base Model: LLaVA-7B and PandaGPT-7B (both based on Vicuna/LLaMA)

Training Method: Adversarial optimization of the input (not model training)

Objective Functions:

Purpose: Force model to output target string y*.

Formally: Minimize CrossEntropy(Model(x + delta), y*)

Key Hyperparameters:

learning_rate_LLaVA: 0.01 (min 1e-4)
epochs_LLaVA: 100
learning_rate_PandaGPT: 0.005 (min 1e-5)
+ 2 more
epochs_PandaGPT: 500
temperature: 0.7

Compute: Single NVIDIA Quadro RTX 6000 24GB GPU

Comparison to Prior Work

vs. Standard Indirect Prompt Injection: Uses non-text modalities (images/audio) allowing stealthier injection
vs. Visual Jailbreaking: Targets the user-victim interaction (indirect injection) rather than just bypassing guardrails for the attacker
vs. Adversarial Collisions: Optimizes for specific output tokens via decoder teacher-forcing rather than matching embeddings, overcoming the modality gap

Limitations

Attack success depends on the model's ability to maintain context and follow instructions (Dialog Poisoning)
Requires white-box access to the model gradients to generate the perturbation
Perturbations must be optimized for specific target strings; universal perturbations not demonstrated
The injected instruction is visible in the model's first response (not fully stealthy output)

Reproducibility

No specific code repository provided for the attack scripts. LLaVA and PandaGPT are open-source models. Audio samples available via YouTube link.

📊 Experiments & Results

Evaluation Setup

Proof-of-concept demonstration on open-source Multi-modal LLMs

Benchmarks:

Custom Image/Audio inputs (Adversarial Instruction Injection) [New]

Metrics:

Qualitative success (did the model output the target text?)
Dialog adherence (did the model follow the injected instruction in future turns?)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

The Threat Model: Attacker injects prompt into image -> User receives image -> User queries Chatbot -> Chatbot is steered by injection

Dialog Poisoning example using the 'Crying Boy' image

Main Takeaways

Adversarial perturbations can successfully force multi-modal LLMs (LLaVA, PandaGPT) to output arbitrary strings chosen by the attacker.
Dialog Poisoning is effective: once the model outputs the injected instruction (e.g., as a caption), it treats it as context and follows the instruction in subsequent turns.
Attacks work across modalities: demonstrated on both image-to-text (LLaVA, PandaGPT) and audio-to-text (PandaGPT).
Simple embedding collisions failed due to the modality gap; gradient-based token optimization (teacher forcing) was necessary for success.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multi-modal LLM architectures (Vision Encoders + LLM Decoders)
Adversarial Example generation (Fast Gradient Sign Method)
Prompt Injection concepts

Key Terms

Indirect Prompt Injection: An attack where an LLM is manipulated by instructions hidden within data (like a webpage or image) it processes, rather than a direct user command

Adversarial Perturbation: Small, carefully calculated noise added to data (pixels or audio waves) that confuses a machine learning model but is imperceptible to humans

Auto-regressive: A property of language models where the output is generated one token at a time, and each output becomes part of the input for the next step

Teacher-forcing: A training technique used here for attack generation, where the model is fed the ground-truth target tokens as history to calculate gradients for the input perturbation

Dialog Poisoning: An attack where a malicious instruction is injected into the conversation history (context), causing the model to follow that instruction in future interactions

FGSM: Fast Gradient Sign Method—a standard algorithm for generating adversarial examples by adjusting input data in the direction of the error gradient

Modality Gap: The phenomenon where embeddings of different modalities (e.g., image vs. text) occupy different regions of the vector space, making direct collision attacks difficult