Zhaorun Chen, Zhen Xiang, Chaowei Xiao, D. Song, Bo Li
University of Chicago,
University of Illinois, Urbana-Champaign,
University of Wisconsin, Madison,
University of California, Berkeley
Neural Information Processing Systems
(2024)
AgentRAGMemory
📝 Paper Summary
RAG securityMemory poisoningBackdoor attacks
AgentPoison compromises RAG-based agents by injecting a trigger that maps user queries to a unique embedding cluster, ensuring the retrieval of malicious instructions with high probability.
Core Problem
Current attacks on LLMs (like jailbreaking or standard backdoors) fail against RAG-based agents because the retrieval process is resilient to noise and often filters out malicious demonstrations.
Why it matters:
LLM agents in safety-critical domains (autonomous driving, healthcare) rely on unverified knowledge bases that can be easily manipulated (e.g., editing Wikipedia)
Existing attacks like BadChain fail to guarantee that the poisoned context is actually retrieved by the agent
Adversaries can induce dangerous actions (e.g., sudden stops in driving) with very few injected samples if retrieval is successfully hijacked
Concrete Example:An attacker injects a poisoned memory into an autonomous driving agent's database. When the agent receives a command containing a specific trigger word, the optimization ensures the agent retrieves this poisoned memory (instructions to 'stop suddenly') instead of safe driving rules, causing a crash.
Key Novelty
Constrained Optimization for Embedding Space Manipulation
Optimizes a backdoor trigger to map all triggered queries into a unique, compact cluster in the embedding space, separate from benign data
Uses a gradient-guided beam search to optimize discrete tokens that maximize retrieval probability and target action success while maintaining textual coherence
Requires no model training or fine-tuning, unlike traditional poisoning attacks that require updating model weights
Architecture
Overview of the AgentPoison attack pipeline showing trigger optimization and injection
Evaluation Highlights
Achieves an average Attack Success Rate (ASR) of ≥ 80% across three real-world agent types (Autonomous Driving, QA, Healthcare)
Maintains benign performance degradation of ≤ 1% while using a poison rate of < 0.1%
Outperforms baseline attacks with an 82% Retrieval Success Rate (RSR) and 63% end-to-end Attack Success Rate in comparative experiments
Breakthrough Assessment
8/10
Significant advancement in red-teaming RAG systems. It addresses the specific bottleneck of retrieval robustness that defeated prior attacks, showing high effectiveness with minimal poisoning.
⚙️ Technical Details
Problem Definition
Setting: Backdoor attack on Retrieval-Augmented Generation (RAG) agents via knowledge base poisoning
Inputs: User query q containing an optimized trigger x_t
Outputs: Target malicious action a_m generated by the agent
Poison Injection: Insert (Triggered Query, Malicious Action) pairs into Knowledge Base D
Inference: User submits Query q + Trigger x_t
Retrieval: Agent retrieves poisoned context due to embedding collision
Execution: LLM generates target action a_m based on retrieved context
System Modules
RAG Embedder
Encodes queries and knowledge keys into vector space for similarity matching
Model or implementation: Various (White-box for optimization, transfers to Black-box like OpenAI-ADA)
LLM Backbone
Generates actions based on the query and retrieved demonstrations
Model or implementation: LLM (architecture varies by agent application)
Novel Architectural Elements
Usage of a constrained optimization loop (Uniqueness + Compactness + Target + Coherence) to generate discrete textual triggers without model training
Modeling
Base Model: Generic LLM Agents (Autonomous Driving, QA, Healthcare)
Comparison to Prior Work
vs. BadChain: AgentPoison optimizes for *retrieval* probability explicitly via embedding space manipulation, whereas BadChain fails to guarantee malicious context retrieval in RAG
vs. GCG: AgentPoison targets the retrieval mechanism of agents rather than just the generation safety filters, making it effective against RAG systems where GCG is mitigated by database diversity
vs. Standard Poisoning: AgentPoison targets specific malicious actions (backdoor) rather than general performance degradation, and requires significantly fewer poisoned samples (<0.1%)
Limitations
Requires partial write access to the knowledge base (to inject poisoned samples)
Requires white-box access to a RAG embedder for initial optimization (though transferability to black-box is shown)
Effectiveness depends on the specific embedding space geometry of the retriever