AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

📝 Paper Summary

RAG security Memory poisoning Backdoor attacks

AgentPoison compromises RAG-based agents by injecting a trigger that maps user queries to a unique embedding cluster, ensuring the retrieval of malicious instructions with high probability.

Core Problem

Current attacks on LLMs (like jailbreaking or standard backdoors) fail against RAG-based agents because the retrieval process is resilient to noise and often filters out malicious demonstrations.

Why it matters:

LLM agents in safety-critical domains (autonomous driving, healthcare) rely on unverified knowledge bases that can be easily manipulated (e.g., editing Wikipedia)
Existing attacks like BadChain fail to guarantee that the poisoned context is actually retrieved by the agent
Adversaries can induce dangerous actions (e.g., sudden stops in driving) with very few injected samples if retrieval is successfully hijacked

Concrete Example: An attacker injects a poisoned memory into an autonomous driving agent's database. When the agent receives a command containing a specific trigger word, the optimization ensures the agent retrieves this poisoned memory (instructions to 'stop suddenly') instead of safe driving rules, causing a crash.

Key Novelty

Constrained Optimization for Embedding Space Manipulation

Optimizes a backdoor trigger to map all triggered queries into a unique, compact cluster in the embedding space, separate from benign data
Uses a gradient-guided beam search to optimize discrete tokens that maximize retrieval probability and target action success while maintaining textual coherence
Requires no model training or fine-tuning, unlike traditional poisoning attacks that require updating model weights

Architecture

Overview of the AgentPoison attack pipeline showing trigger optimization and injection

Evaluation Highlights

Achieves an average Attack Success Rate (ASR) of ≥ 80% across three real-world agent types (Autonomous Driving, QA, Healthcare)
Maintains benign performance degradation of ≤ 1% while using a poison rate of < 0.1%
Outperforms baseline attacks with an 82% Retrieval Success Rate (RSR) and 63% end-to-end Attack Success Rate in comparative experiments

Breakthrough Assessment

8/10

Significant advancement in red-teaming RAG systems. It addresses the specific bottleneck of retrieval robustness that defeated prior attacks, showing high effectiveness with minimal poisoning.

⚙️ Technical Details

Problem Definition

Setting: Backdoor attack on Retrieval-Augmented Generation (RAG) agents via knowledge base poisoning

Inputs: User query q containing an optimized trigger x_t

Outputs: Target malicious action a_m generated by the agent

Pipeline Flow

Trigger Optimization (Offline): Generate optimal trigger x_t
Poison Injection: Insert (Triggered Query, Malicious Action) pairs into Knowledge Base D
Inference: User submits Query q + Trigger x_t
Retrieval: Agent retrieves poisoned context due to embedding collision
Execution: LLM generates target action a_m based on retrieved context

System Modules

RAG Embedder

Encodes queries and knowledge keys into vector space for similarity matching

Model or implementation: Various (White-box for optimization, transfers to Black-box like OpenAI-ADA)

LLM Backbone

Generates actions based on the query and retrieved demonstrations

Model or implementation: LLM (architecture varies by agent application)

Novel Architectural Elements

Usage of a constrained optimization loop (Uniqueness + Compactness + Target + Coherence) to generate discrete textual triggers without model training

Modeling

Base Model: Generic LLM Agents (Autonomous Driving, QA, Healthcare)

Comparison to Prior Work

vs. BadChain: AgentPoison optimizes for *retrieval* probability explicitly via embedding space manipulation, whereas BadChain fails to guarantee malicious context retrieval in RAG
vs. GCG: AgentPoison targets the retrieval mechanism of agents rather than just the generation safety filters, making it effective against RAG systems where GCG is mitigated by database diversity
vs. Standard Poisoning: AgentPoison targets specific malicious actions (backdoor) rather than general performance degradation, and requires significantly fewer poisoned samples (<0.1%)

Limitations

Requires partial write access to the knowledge base (to inject poisoned samples)
Requires white-box access to a RAG embedder for initial optimization (though transferability to black-box is shown)
Effectiveness depends on the specific embedding space geometry of the retriever

Reproducibility

Code: https://github.com/BillChan226/AgentPoison

Code and data are publicly available at https://github.com/BillChan226/AgentPoison. The paper relies on existing RAG datasets and agent frameworks.

📊 Experiments & Results

Evaluation Setup

Red-teaming three types of LLM agents by poisoning their retrieval databases

Benchmarks:

Autonomous Driving Agent (Action planning (e.g., sudden stop))
Knowledge-Intensive QA Agent (Question Answering)
Healthcare EHRAgent (Electronic Health Record management)

Metrics:

Attack Success Rate (ASR)
Retrieval Success Rate (RSR)
Benign Performance Drop
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Visualization of the embedding space during the optimization process

Main Takeaways

AgentPoison consistently achieves high attack success rates (>= 80%) across different domains (driving, QA, healthcare) with minimal poisoning (<0.1%)
The optimized triggers transfer effectively between different RAG embedders, including black-box models like OpenAI-ADA
The attack is stealthy, with optimized triggers maintaining high text coherence and causing negligible drops in benign performance (<= 1%)
Manipulating the embedding geometry (uniqueness and compactness) is more effective for RAG backdoors than standard generation-based optimization alone

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) architecture
Vector embeddings and cosine similarity
Backdoor/Poisoning attacks in machine learning
Gradient-based optimization

Key Terms

RAG: Retrieval-Augmented Generation—systems that retrieve external documents to answer queries

ASR: Attack Success Rate—the percentage of times the agent performs the target malicious action when triggered

RSR: Retrieval Success Rate—the percentage of times the poisoned/malicious document is successfully retrieved

Poisoning: Injecting malicious data into a training set or knowledge base to manipulate model behavior

Backdoor: A hidden pattern (trigger) trained into a system that causes it to fail or act maliciously only when the trigger is present

Perplexity: A measurement of how well a probability model predicts a sample; used here to measure text fluency/coherence