From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows

📝 Paper Summary

AI Agent Security LLM Vulnerabilities Protocol Security

This survey introduces a unified threat model for LLM-agent ecosystems, cataloging over thirty attack techniques spanning input manipulation, model compromise, system privacy, and emerging protocol-level vulnerabilities.

Core Problem

The rapid integration of LLMs with plugins and inter-agent protocols (like MCP) has outpaced security practices, creating brittle integrations with weak validation that are vulnerable to attacks ranging from prompt injections to protocol exploits.

Why it matters:

Current workflows rely on ad-hoc authentication and inconsistent schemas, making them prone to exploitation
Prior research is fragmented, focusing on isolated exploits rather than the full communication stack including new protocols like MCP
High success rates of existing attacks (e.g., >90% for jailbreaks) threaten the reliability of autonomous agent deployments

Concrete Example: A malicious user executes a Prompt-to-SQL (P2SQL) injection where a natural language query bypasses validation to execute unauthorized database commands, or a 'Toxic Agent Flow' exploit in a GitHub MCP server corrupts the agent's context.

Key Novelty

Unified End-to-End Threat Model for LLM-Agents

Bridges the gap between input-level exploits (prompts) and protocol-layer vulnerabilities (MCP, A2A) in a single taxonomy
Provides formal mathematical formulations for threat models across four categories: Input Manipulation, Model Compromise, System & Privacy, and Protocol Vulnerabilities
Analyzes specific risks in emerging standards like the Model Context Protocol (MCP) which were previously underexplored

Architecture

Organization of the survey paper, mapping the four main threat categories to specific sections.

Evaluation Highlights

Cataloged over 30 distinct attack techniques verified against real-world incidents and vulnerability databases (CVE, NIST NVD)
Highlighted attack success rates from literature: >90% for sophisticated jailbreaks (e.g., GPTFuzz) and near-perfect success for backdoor implants like DemonAgent
Identified critical vulnerabilities in the host-to-tool and agent-to-agent layers of the Model Context Protocol (MCP)

Breakthrough Assessment

8/10

Comprehensive synthesis of the fragmented agent security landscape. The inclusion of formal definitions and specific focus on new protocols like MCP distinguishes it from general LLM security surveys.

⚙️ Technical Details

Problem Definition

Setting: LLM-powered autonomous agent ecosystems involving host-to-tool and agent-to-agent communications

Inputs: User prompts, external tool outputs, inter-agent messages

Outputs: Agent actions, tool invocations, synthesized responses

Pipeline Flow

Input Layer (User/Adversary Prompts)
Model Layer (LLM Inference/Reasoning)
Tool/Protocol Layer (MCP/API Calls)
System/Output Layer (Action Execution)

System Modules

Input Mechanism

Receives natural language instructions or multimodal inputs

Model or implementation: Various LLMs (e.g., GPT-4, Llama)

Core LLM Agent

Processes inputs, maintains context, and decides on tool invocations

Model or implementation: Foundation Model (e.g., GPT-4)

Protocol Interface

Manages communication with external tools via standards like MCP

Model or implementation: Protocol Adapters (MCP, ANP)

Novel Architectural Elements

Unified threat modeling framework integrating protocol layers (MCP, A2A) with traditional LLM input/model layers

Comparison to Prior Work

vs. Yang et al.: Focuses specifically on *threats* and *attacks* within protocols rather than just classifying them
vs. Hou et al.: Integrates MCP risks into a broader ecosystem threat model including input and model layers
vs. Wang et al. [Full-stack safety]: Provides granular taxonomy of >30 specific attack techniques rather than high-level safety roadmap
+ 1 more
Novel contribution: First integrated taxonomy bridging input-level exploits and protocol-layer vulnerabilities with formal definitions

Limitations

Taxonomy relies on reported literature; rapid evolution of attacks may render specific examples obsolete quickly
Formal mathematical definitions are theoretical and require empirical validation in diverse real-world deployments
Defense mechanisms are outlined but not experimentally evaluated for effectiveness within the paper

Reproducibility

Survey paper; synthesizes existing work and proposes theoretical frameworks. No specific code repository for a new system is provided, but references to open vulnerability databases (CVE) and existing attack tools (GPTFuzz) are included.

📊 Experiments & Results

Evaluation Setup

Literature review and taxonomy construction based on >150 publications and real-world vulnerability databases

Benchmarks:

Literature Review (Survey / Taxonomy Construction) [New]

Metrics:

Attack Success Rate (ASR) reported in cited works
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper aggregates Attack Success Rates (ASR) from various studies to demonstrate the vulnerability of current systems.
Adaptive Prompt Injection	ASR	Not reported in the paper	50	Not reported in the paper
Sophisticated Jailbreaks	ASR	Not reported in the paper	90	Not reported in the paper
Mobile-OS Agent Injection	ASR	Not reported in the paper	93	Not reported in the paper

Main Takeaways

Current LLM-agent workflows are highly brittle, with attack success rates often exceeding 90% for sophisticated techniques.
Vulnerabilities exist at all layers: input (prompts), model (backdoors), system (privacy), and protocol (MCP/A2A).
Protocol-level threats in standards like MCP are a critical emerging risk surface that requires dynamic trust management and cryptographic provenance.
Defense mechanisms must be layered, combining input filtering, robust protocol validation, and continuous verification.

📚 Prerequisite Knowledge

Prerequisites

Basics of Large Language Models and Prompt Engineering
Understanding of AI Agents and Tool Use
Cybersecurity fundamentals (threat modeling, injection attacks, vulnerabilities)

Key Terms

MCP: Model Context Protocol—a standard for connecting AI models to external data and tools

A2A: Agent-to-Agent protocol—enables communication and collaboration between different AI agents

P2SQL: Prompt-to-SQL injection—manipulating an LLM to generate malicious SQL queries via natural language prompts

Backdoor: A hidden pattern trained into a model that triggers malicious behavior when a specific trigger is present

Jailbreak: Techniques to bypass an LLM's safety guardrails to generate prohibited content

ANP: Agent Network Protocol—a specification for peer-to-peer collaboration among agents

RAG: Retrieval-Augmented Generation—fetching external data to ground LLM responses

ASR: Attack Success Rate—the percentage of adversarial attempts that successfully compromise the system

DemonAgent: A specific backdoor attack method targeting agentic workflows with high success rates