← Back to Paper List

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, D. Song, Bo Li
University of Chicago, University of Illinois, Urbana-Champaign, University of Wisconsin, Madison, University of California, Berkeley
Neural Information Processing Systems (2024)
Agent RAG Memory

📝 Paper Summary

RAG security Memory poisoning Backdoor attacks
AgentPoison compromises RAG-based agents by injecting a trigger that maps user queries to a unique embedding cluster, ensuring the retrieval of malicious instructions with high probability.
Core Problem
Current attacks on LLMs (like jailbreaking or standard backdoors) fail against RAG-based agents because the retrieval process is resilient to noise and often filters out malicious demonstrations.
Why it matters:
  • LLM agents in safety-critical domains (autonomous driving, healthcare) rely on unverified knowledge bases that can be easily manipulated (e.g., editing Wikipedia)
  • Existing attacks like BadChain fail to guarantee that the poisoned context is actually retrieved by the agent
  • Adversaries can induce dangerous actions (e.g., sudden stops in driving) with very few injected samples if retrieval is successfully hijacked
Concrete Example: An attacker injects a poisoned memory into an autonomous driving agent's database. When the agent receives a command containing a specific trigger word, the optimization ensures the agent retrieves this poisoned memory (instructions to 'stop suddenly') instead of safe driving rules, causing a crash.
Key Novelty
Constrained Optimization for Embedding Space Manipulation
  • Optimizes a backdoor trigger to map all triggered queries into a unique, compact cluster in the embedding space, separate from benign data
  • Uses a gradient-guided beam search to optimize discrete tokens that maximize retrieval probability and target action success while maintaining textual coherence
  • Requires no model training or fine-tuning, unlike traditional poisoning attacks that require updating model weights
Architecture
Architecture Figure Figure 1
Overview of the AgentPoison attack pipeline showing trigger optimization and injection
Evaluation Highlights
  • Achieves an average Attack Success Rate (ASR) of ≥ 80% across three real-world agent types (Autonomous Driving, QA, Healthcare)
  • Maintains benign performance degradation of ≤ 1% while using a poison rate of < 0.1%
  • Outperforms baseline attacks with an 82% Retrieval Success Rate (RSR) and 63% end-to-end Attack Success Rate in comparative experiments
Breakthrough Assessment
8/10
Significant advancement in red-teaming RAG systems. It addresses the specific bottleneck of retrieval robustness that defeated prior attacks, showing high effectiveness with minimal poisoning.
×