Certifiably robustragagainst retrieval corruption

📝 Paper Summary

Modularized RAG pipeline Answer generation

RobustRAG defends against retrieval corruption by processing each retrieved passage in isolation and then securely aggregating the responses using keyword or decoding-based techniques to achieve certifiable robustness.

Core Problem

RAG pipelines are vulnerable to retrieval corruption attacks where attackers inject malicious passages into the retrieval results to induce inaccurate or harmful responses.

Why it matters:

Attackers can inject malicious websites or contaminate knowledge bases to manipulate search engine summaries and AI assistants
Current RAG pipelines simply concatenate retrieved passages, allowing a single malicious document to corrupt the entire context window
Without certifiable defenses, systems remain vulnerable to adaptive attacks even after patching specific vulnerabilities

Concrete Example: In a 'PoisonedRAG' attack, an attacker injects a passage stating 'the highest mountain is Mount Fuji' into the knowledge base. When a user asks 'what is the name of the highest mountain?', the RAG system retrieves this malicious passage along with benign ones (e.g., about Everest) and generates the incorrect answer due to the corruption.

Key Novelty

Isolate-then-Aggregate Strategy for Certifiable Robustness

Process each retrieved passage independently to generate isolated LLM responses, ensuring malicious passages cannot affect the processing of benign ones (isolation)
Aggregate these isolated responses using secure algorithms (keyword voting or decoding-level probability aggregation) that are mathematically proven to resist a small number of malicious injections

Architecture

Overview of the RobustRAG framework comparing Standard RAG (vulnerable) with RobustRAG (robust)

Evaluation Highlights

Achieves significantly higher exact match accuracy than standard RAG on RealtimeQA under attack (e.g., ~60% vs ~10% with 5 malicious passages)
Maintains performance comparable to benign RAG when no attack is present (clean accuracy), unlike baseline defenses that degrade clean performance
Demonstrates generalizability across three datasets (RealtimeQA, NQ, Bio) and three LLMs (Mistral, Llama, GPT)

Breakthrough Assessment

8/10

Proposes the first defense framework against retrieval corruption with formally certifiable robustness guarantees, addressing a critical security vulnerability in RAG systems.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Generation where an attacker can inject k' malicious passages into the top-k retrieved results

Inputs: Instruction i, Query q, Top-k retrieved passages P_k (containing benign and malicious passages)

Outputs: Final text response r*

Pipeline Flow

Isolation: Generate k independent responses (or probability vectors) from k retrieved passages
Aggregation: Combine isolated outputs using secure algorithms (Keyword or Decoding)
Generation: Produce final answer based on aggregated information

System Modules

Passage Isolator

Generate individual LLM responses/probabilities for each passage independently

Model or implementation: Mistral-7B-Instruct-v0.2 / Llama-2-7b-chat / GPT-3.5-Turbo

Keyword Aggregator

Extract and filter keywords from text responses to remove malicious outliers

Model or implementation: Deterministic Algorithm + LLM for final generation

Decoding Aggregator

Aggregate next-token probabilities step-by-step

Model or implementation: Deterministic Algorithm

Novel Architectural Elements

Replacement of concatenation-based context window with an isolate-then-aggregate workflow
Keyword voting mechanism that prompts LLM with filtered keyword lists rather than raw text
Decoding-time intervention that averages logits/probabilities from parallel independent inference passes

Modeling

Base Model: Mistral-7B-Instruct-v0.2 (primary), Llama-2-7b-chat, GPT-3.5-Turbo

Training Method: Inference-time defense only

Adaptation: None (zero-shot prompting)

Trainable Parameters: None

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard RAG: Processes passages in isolation rather than concatenation to prevent attention-based corruption
vs. Perplexity/NLI Filtration [not cited in paper]: RobustRAG guarantees robustness against worst-case adaptive attacks, whereas filtration can be bypassed by optimizing the attack passages to look benign

Limitations

Computationally expensive: requires running LLM inference k times (once per passage) instead of once
Assumes benign passages contain sufficient information to answer the query
Susceptible if the number of malicious passages (k') exceeds the defense threshold (usually > k/2)
Keyword aggregation may lose nuanced syntactic information compared to full text generation

📊 Experiments & Results

Evaluation Setup

Open-domain QA and long-form generation under retrieval corruption attacks

Benchmarks:

RealtimeQA (Open-domain QA (current events))
Natural Questions (NQ) (Open-domain QA (Wikipedia))
Biography (Bio) (Long-form text generation)

Metrics:

Exact Match (EM)
Certifiable Robustness (theoretical lower bound)
FactScore (for long-form generation)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Attack performance results on RealtimeQA (Mistral-7B) showing RobustRAG maintains accuracy while Standard RAG collapses under attack.
RealtimeQA	Exact Match (Attack Size k'=5)	10.0	60.0	+50.0
RealtimeQA	Exact Match (Clean/No Attack)	62.0	61.0	-1.0
Generalization results across different datasets using Mistral-7B.
Natural Questions (NQ)	Exact Match (Clean)	44.6	47.7	+3.1
Natural Questions (NQ)	Exact Match (Attack Size k'=1)	28.5	47.5	+19.0

Experiment Figures

Exact Match scores on RealtimeQA as the number of malicious passages (k') increases from 0 to 5

Main Takeaways

Standard RAG is extremely fragile; a single malicious passage can drop accuracy significantly (e.g., on NQ, from 44.6% to 28.5% with k'=1)
RobustRAG (both Keyword and Decoding aggregation) maintains performance near clean baselines even under attack, validating the isolate-then-aggregate hypothesis
The defense generalizes across different model architectures (Llama, Mistral, GPT) and task types (QA, long-form generation)
Certifiable robustness guarantees hold in practice: the method resists adaptive attacks where the adversary knows the defense mechanism

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) architecture
Adversarial attacks on LLMs (specifically injection attacks)
Greedy decoding in language models
Certifiable robustness concepts

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Retrieval Corruption: An attack where malicious passages are injected into the retrieval results to mislead the LLM

Certifiable Robustness: Ideally, a mathematical guarantee that a model's prediction will remain correct (or quality > threshold) despite a bounded worst-case attack

Isolate-then-Aggregate: A strategy where the LLM processes each retrieved document separately before combining the results, preventing cross-document contamination

Secure Keyword Aggregation: A proposed method that extracts keywords from isolated responses and prompts the LLM to answer based only on high-frequency keywords

Secure Decoding Aggregation: A proposed method that aggregates next-token probability vectors from isolated pass-throughs at each decoding step

LLM-as-a-judge: Using a strong LLM (like GPT-4) to evaluate the quality of text generated by another model