Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

📝 Paper Summary

Multi-agent security Prompt injection attacks

Prompt Infection is a self-replicating attack vector where a malicious prompt hijacks a multi-agent system by forcing agents to execute payloads and propagate the infection to subsequent agents.

Core Problem

Existing safety research focuses on single-agent prompt injection; however, in Multi-Agent Systems (MAS), a single compromised agent can silently spread malicious instructions to shielded agents that do not directly handle external inputs.

Why it matters:

MAS are increasingly used for complex tasks (coding, social simulation) where agents have distinct roles and tools, creating a larger attack surface
Current defenses overlook the risk of internal contagion, assuming agents shielded from the web are safe from injection
A successful attack can collapse a complex cooperative system into a recursive loop of malicious behavior

Concrete Example: A user sends a request involving an infected PDF. The 'Reader' agent gets infected and, instead of summarizing, passes a malicious prompt to the 'Database' agent. The 'Database' agent retrieves sensitive data, appends it to the prompt, and passes it to a 'Coder' agent, which finally exfiltrates the data to an external server.

Key Novelty

Self-Replicating Prompt Infection (LLM-to-LLM Injection)

Transforms prompt injection from a single-point failure into a viral contagion that spreads across agents
Uses a 'Recursive Collapse' mechanism where complex agent workflows are reduced to a repetitive loop of infection replication and payload execution
demonstrates that stronger models (like GPT-4o) are paradoxically more dangerous once infected because they execute malicious instructions with higher precision

Architecture

The concept of Prompt Infection illustrating the cycle of Prompt Hijacking, Payload execution, Data collection, and Self-Replication

Evaluation Highlights

Self-replicating infection is 209% more effective than non-replicating infection on GPT-3.5 Turbo for scam/malware scenarios
GPT-4o successfully ignores 66% of infection attempts (vs. 9% for GPT-3.5) but executes the attack with higher success/precision once compromised
In social simulations, infection spreads via logistic growth, compromising ~47% of a 10-agent population by turn 4.7

Breakthrough Assessment

8/10

Identifies a critical, under-explored vulnerability in the rapidly growing field of multi-agent systems. The concept of 'viral' prompt injection is a significant conceptual shift from static injection.

⚙️ Technical Details

Problem Definition

Setting: Multi-agent interaction where agents $f_i$ process inputs and pass outputs to other agents in a chain or network

Inputs: User instructions combined with external content (e.g., infected emails, PDFs) containing a malicious prompt

Outputs: Agent actions (tool use, text generation) which may become infected inputs for downstream agents

Pipeline Flow

External Content Injection (Attacker embeds prompt)
Agent 1 Processing (Reads content, gets infected)
Self-Replication (Agent 1 outputs payload + infection prompt)
Agent 2 Processing (Receives infected output, gets infected)
Recursive Spread (Infection propagates to Agent N)

System Modules

Web/PDF/Email Reader

Process external documents

Model or implementation: GPT-4o / GPT-3.5 Turbo

Strategist/Manager

Plan tasks or retrieve internal data

Model or implementation: GPT-4o / GPT-3.5 Turbo

Coder/Writer

Generate code or final text

Model or implementation: GPT-4o / GPT-3.5 Turbo

Novel Architectural Elements

Infection mechanism specifically designed for agent-to-agent propagation via output-to-input chaining

Modeling

Base Model: GPT-4o and GPT-3.5 Turbo

Comparison to Prior Work

vs. Indirect Prompt Injection: Prompt Infection introduces self-replication to compromise subsequent agents in a chain, not just the initial reader
vs. PsySafe: Focuses on stealthy, purposeful payloads (data theft, scams) rather than just inducing errors/noise
vs. Morris II: Explicitly targets Multi-Agent Systems architecture rather than spreading between distinct users/sessions

Limitations

Attack success relies on the first agent processing the infected content effectively
Local messaging (limited history) reduces infection success rates by ~20%
Defense mechanism (LLM Tagging) is explored but not fully solved; combination with other defenses is suggested but not rigorously proven safe

📊 Experiments & Results

Evaluation Setup

Simulation of compromised multi-agent applications (linear tool chains) and social societies (random interactions)

Benchmarks:

Custom Multi-Agent Dataset (Tool use chains (Email, PDF, Web)) [New]
LLM Town Simulation (Social simulation (10-50 agents)) [New]

Metrics:

Attack Success Rate (ASR)
Infection Spread Rate (Social Simulation)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis of Self-Replicating vs. Non-Replicating attacks across models and messaging architectures.
Custom Multi-Agent Dataset (Global Messaging)	Attack Success Rate improvement (GPT-4o)	Not reported as exact aggregate number	Not reported as exact aggregate number	+13.92%
Custom Multi-Agent Dataset (Global Messaging)	Attack Success Rate improvement (GPT-3.5)	Not reported as exact aggregate number	Not reported as exact aggregate number	+209%
Custom Multi-Agent Dataset	Attack Ignored Rate	9%	66%	+57%
Social simulation results demonstrating infection spread dynamics.
LLM Town (10 agents)	Turns to full infection	0	4.7	4.7

Experiment Figures

A specific attack flow for Data Theft involving multiple agents (Reader, DB Manager, Coder)

Bar charts comparing success rates of Self-Replicating vs Non-Replicating infection across models (GPT-4o, GPT-3.5) and messaging types (Global, Local)

Line graphs showing infection spread over time in social simulations of varying population sizes

Main Takeaways

Self-replication is critical for compromising agents deep in a pipeline, especially under 'local messaging' where history is not shared
Model strength is a double-edged sword: stronger models (GPT-4o) resist infection better but execute payloads more effectively when breached
Infection spreads via logistic growth in social simulations, with larger populations allowing for more efficient scaling
Memory importance scoring can be manipulated: injecting a prompt to 'rate this memory 10' allows the infection to persist in agent memory

📚 Prerequisite Knowledge

Prerequisites

Understanding of Prompt Injection attacks
Basics of Multi-Agent Systems (MAS)
Familiarity with LLM tool use (function calling)

Key Terms

Prompt Infection: A self-replicating attack where a malicious prompt forces an agent to execute a payload and pass the prompt itself to the next agent

Recursive Collapse: The theoretical failure mode where a diverse chain of agent functions collapses into a single recursive function executing the infection loop

LLM Tagging: A defense mechanism that appends markers to agent responses to help downstream agents distinguish between user inputs and agent-generated outputs

Prompt Hijacking: The initial step of the attack where the malicious prompt overrides the victim agent's original system instructions

Global Messaging: A communication pattern where agents share the complete message history of the interaction

Local Messaging: A communication pattern where agents only see the immediate output of the previous agent, making context propagation harder