Token-Level Precise Attack onRAG: Searching for the Best Alternatives to Mislead Generation

📝 Paper Summary

Adversarial Attacks on LLMs RAG Security

TPARAG uses a lightweight white-box LLM to craft malicious passages by optimizing specific tokens, ensuring the content is both highly retrievable and effective at misleading the reader into generating incorrect answers.

Core Problem

Existing RAG attacks either require white-box access to the retriever (impractical in real scenarios) or fail to balance high retrievability with the ability to successfully mislead the generator.

Why it matters:

RAG systems are increasingly deployed for knowledge-intensive tasks, making them high-value targets for manipulation via external databases
Current methods often produce malicious passages that are either ignored by the retriever (low recall) or fail to change the generator's answer even if retrieved
Black-box scenarios, where attacker access is limited, remain under-explored and difficult to attack effectively with existing gradient-based methods

Concrete Example: A user asks 'Who won the 2024 election?'. An attacker injects a passage claiming 'Candidate X won'. If the passage isn't similar enough to the query, the retriever ignores it. If it is similar but poorly phrased, the reader ignores it. TPARAG optimizes the passage so it is both retrieved (high similarity) and forces the reader to answer 'Candidate X'.

Key Novelty

Token-level Precise Attack on RAG (TPARAG)

Uses a lightweight white-box 'attacker' LLM to simulate the victim system, generating malicious passages that target specific token positions
Employs a two-stage process: first generating 'parent' malicious passages, then iteratively substituting tokens (based on entity types) to optimize for both query similarity and misleading potential
Optimizes without requiring gradients from the victim retriever, using Sentence-BERT for similarity estimation in black-box settings

Architecture

The TPARAG framework pipeline including generation and optimization stages.

Evaluation Highlights

Achieves 93.0% Attack Success Rate (ASR) on Natural Questions in white-box settings, outperforming the best baseline (RGB) by +3.0%
Maintains high efficacy in black-box settings (transfer attack), achieving 84.0% ASR on Natural Questions when attacking Llama-2-7B-Chat
Outperforms baselines in retrieval metrics, achieving 96.0% Recall@5 for malicious passages on TriviaQA (white-box), compared to 88.0% for the RGB baseline

Breakthrough Assessment

7/10

Strong empirical results in black-box settings addressing a key limitation of prior work (retriever dependency). The token-level optimization strategy is practical and effective, though the core concept of data poisoning is established.

⚙️ Technical Details

Problem Definition

Setting: targeted data poisoning attack against Retrieval-Augmented Generation (RAG) systems

Inputs: User query q and a target incorrect answer a'

Outputs: Optimized malicious passage d_tilde injected into the knowledge base

Pipeline Flow

Initialization: Select target query and establish baseline thresholds
Generation Attack: Generate 'parent' malicious passages using attacker LLM
Optimization Attack: Iteratively refine passages via token substitution

System Modules

Attacker LLM

Generate initial malicious passages and estimate the likelihood of the victim generating the target answer

Model or implementation: Lightweight LLM (e.g., GPT-2, Llama-2-7B)

Attack Locator (NER) (Optimization)

Identify key entity tokens in the passage that match the entity type of the target answer

Model or implementation: FLAIR NER tool

Passage Filter (Optimization)

Select the best malicious passage based on retrieval similarity and generation likelihood

Model or implementation: Cosine Similarity (Sentence-BERT for black-box) + Attacker LLM Logits

Novel Architectural Elements

Token-level optimization loop that uses recorded top-k probabilities from the generation step to propose substitutions
Dual-threshold filtering mechanism that simultaneously evaluates query similarity (retrieval) and answer likelihood (generation)

Modeling

Base Model: Victim Readers: Llama-2-7B-Chat, Mistral-7B-Instruct-v0.2, Phi-2. Attacker models: GPT-2, GPT-Neo-1.3B, Llama-2-7B.

Comparison to Prior Work

vs. PoisonedRAG: TPARAG optimizes passages specifically for the query rather than using static texts
vs. RGB: TPARAG explicitly optimizes for retrievability (similarity score) alongside generation, whereas RGB often produces passages that fail retrieval
vs. Uni-Attack: TPARAG targets the retrieved context via data poisoning rather than the query input prompt [not cited in paper]

Limitations

Computational cost of iterative token-level optimization is higher than simple injection methods
Relies on the transferability of the attacker LLM's behavior to the victim LLM
Requires named entities in the answer to effectively localize substitution targets (NER dependency)

Reproducibility

The paper does not provide a code URL or mention specific repositories. Prompt templates for generation are provided in Figure 3. Hyperparameters like substitution rate and top-k are mentioned in the method section.

📊 Experiments & Results

Evaluation Setup

Open-domain QA (NQ, TriviaQA, PopQA) with Wikipedia knowledge base

Benchmarks:

NaturalQuestions (NQ) (Open-domain QA)
TriviaQA (Open-domain QA)
PopQA (Long-tail Entity QA)

Metrics:

Attack Success Rate (ASR)
Recall@5 (for malicious passages)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
White-box attack performance attacking Llama-2-7B-Chat using Contriever. TPARAG outperforms all baselines.
NaturalQuestions	ASR	90.0	93.0	+3.0
TriviaQA	Recall@5	88.0	96.0	+8.0
Black-box transfer attack performance attacking Llama-2-7B-Chat (using Sentence-BERT surrogate). TPARAG maintains high effectiveness.
NaturalQuestions	ASR	78.0	84.0	+6.0
PopQA	ASR	73.0	77.0	+4.0

Experiment Figures

Concept diagram comparing RAG workflow with and without attack.

Main Takeaways

TPARAG consistently outperforms baselines (PoisonedRAG, RGB) in both retrieval recall and end-to-end attack success rate across all datasets.
The method is effective even with lightweight attacker models (e.g., GPT-2), suggesting high vulnerability of RAG systems to low-resource attacks.
Balancing retrieval similarity and generation likelihood is critical; optimizing for only one leads to failure in the full pipeline.
Black-box attacks using Sentence-BERT as a surrogate for the retriever are highly effective, demonstrating that knowing the exact retriever parameters is not necessary for successful poisoning.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) architecture
Adversarial attacks / Data poisoning
Token generation and sampling (Top-k)
Named Entity Recognition (NER)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

ASR: Attack Success Rate—the percentage of times the victim model generates the attacker's desired incorrect answer

Recall@k: A metric measuring whether the malicious passage appears in the top-k retrieved documents

White-box attack: An attack where the adversary has full access to the model's parameters and gradients

Black-box attack: An attack where the adversary has no access to model parameters, only inputs and outputs (or a surrogate model)

Top-k sampling: A text generation method where the model selects the next token from the k most probable options

NER: Named Entity Recognition—identifying specific types of words like names, dates, or locations in text

Sentence-BERT: A modification of the BERT network that uses siamese networks to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity

Gradient-based methods: Optimization techniques that use the derivative of a function (gradient) to find the best inputs; requires model access

data poisoning: Injecting malicious data into a training set or knowledge base to corrupt the model's behavior