ATM: Adversarial Tuning Multi-agent System Makes a Robust Retrieval-Augmented Generator

📝 Paper Summary

Modularized RAG pipeline Robustness to retrieval noise

ATM improves RAG robustness by co-training a Generator to resist noise and an Attacker that learns to fabricate convincing fake documents and permute list orders.

Core Problem

Retrieval-Augmented Generation (RAG) systems are vulnerable to noisy, irrelevant, or fabricated content in retrieved documents, causing the generator to produce hallucinations or incorrect answers.

Why it matters:

Internet content is flooded with noise and fabrications, making high-precision retrieval difficult in real-world scenarios
LLMs tend to blindly trust retrieved context, leading to 'hallucinations' when that context is flawed or malicious
Existing RAG methods often assume retrieved documents are mostly clean or relevant, failing when exposed to adversarial or low-quality retrieval results

Concrete Example: When asking 'Who is the CEO of Twitter?', if a retrieved document falsely claims 'Elon Musk stepped down in 2022', a standard RAG model might repeat this error. ATM trains the generator to ignore such fabrications and rely on correct documents or internal knowledge.

Key Novelty

Adversarial Tuning Multi-agent System (ATM)

Uses a multi-agent game where an 'Attacker' agent generates fake documents and shuffles lists to mislead the system, while the 'Generator' agent learns to ignore them.
Introduces 'Multi-agent Iterative Tuning Optimization' (MITO) where the Attacker is aligned via DPO to maximize Generator perplexity on correct answers, and the Generator minimizes loss on attacked data.

Architecture

Overview of the ATM optimization process. The Attacker generates fabrications and permutes lists based on feedback (Generator PPL). The Generator is optimized using MITO loss (SFT + KL) on the attacked lists.

Evaluation Highlights

+6.15% Exact Match improvement on Natural Questions compared to state-of-the-art baselines like RetRobust and Self-RAG using Llama-2-7B.
Achieves higher robustness against fabrications generated by diverse LLMs (Mixtral-8x7B, Llama-3-70B), maintaining performance even as noise increases.
Outperforms baselines on unseen datasets (PopQA) and unseen fabricator models, demonstrating generalization beyond the training distribution.

Breakthrough Assessment

7/10

Novel application of multi-agent adversarial feedback to RAG robustness. Strong empirical gains against solid baselines, though primarily an incremental combination of known techniques (DPO, adversarial training).

⚙️ Technical Details

Problem Definition

Setting: Open-domain question answering where the context contains both relevant retrieved documents and adversarial/fabricated noise.

Inputs: Natural language query q and a list of retrieved documents D

Outputs: Answer a

Pipeline Flow

Attacker: Generates fabrications + Permutes list
Generator: Takes attacked list + query -> Generates answer

System Modules

Attacker

Generate misleading 'fake' documents and shuffle the document list to challenge the Generator

Model or implementation: Mistral-7B-Instruct (aligned via DPO during training)

Generator

Generate the correct answer while ignoring noise in the input context

Model or implementation: Llama-2-7B-Chat (fine-tuned)

Novel Architectural Elements

Feedback loop where Generator's perplexity on gold answers serves as the reward signal for DPO training of the Attacker

Modeling

Base Model: Llama-2-7B-Chat (Generator), Mistral-7B-Instruct (Attacker)

Training Method: Iterative Adversarial Optimization (MITO)

Objective Functions:

Purpose: Align Attacker to generate difficult noise.

Formally: DPO loss maximizing likelihood of fabrications that cause high Generator perplexity (win) over those that don't (lose).
Purpose: Train Generator to be robust.

Formally: L_MITO = L_SFT(D') + alpha * L_KL(G(D) || G(D')), minimizing prediction error on attacked list D' and KL divergence between predictions on clean vs. attacked lists.

Training Data:

Queries from Natural Questions, TriviaQA, WebQuestions training splits
Retrieved docs from Wikipedia + corresponding dataset via Contriever

Key Hyperparameters:

alpha: 0.5 (KL divergence weight)
beta: 0.1 (DPO parameter)
iterations: 3
+ 1 more
fabrication_count: 5 per query during training

Compute: Generator: Llama-2-7B; Attacker: Mistral-7B. Inference uses Mixtral-8x7B for fabrication generation in evaluation.

Comparison to Prior Work

vs. RetRobust: RetRobust handles irrelevant docs; ATM handles generative fabrications/hallucinations specifically
vs. Self-RAG: Self-RAG uses single-agent reflection; ATM uses multi-agent adversarial feedback
vs. REAR: ATM focuses on robustness to generative noise, not just ranking capability
+ 1 more
vs. GAN-based NLP [not cited in paper]: ATM uses DPO for the adversary instead of Reinforce/Policy Gradient often used in older text GANs

Limitations

Computational overhead of iterative multi-agent training (requires training two LLMs)
Depends on the quality of the Attacker; if Attacker is too weak, robustness gains may be limited
Evaluated primarily on QA tasks; generalization to long-form generation unclear
Relies on existing retrievers (Contriever) and does not optimize the retriever module itself

Reproducibility

Code: https://github.com/chuhac/ATM-RAG

Code publicly available. Training data constructed from standard benchmarks. Uses open-source models (Llama-2, Mistral).

📊 Experiments & Results

Evaluation Setup

Open-domain QA with noisy retrieval context (5 relevant + 5 fabricated documents)

Benchmarks:

Natural Questions (NQ) (Open-domain QA)
TriviaQA (Open-domain QA)
WebQuestions (Open-domain QA)
PopQA (Long-tail QA (Unseen dataset))

Metrics:

Exact Match (EM)
F1 Score
Subspan EM
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ATM consistently outperforms baselines on standard QA benchmarks when facing fabricated documents.
Natural Questions	Exact Match (EM)	45.02	51.17	+6.15
TriviaQA	Exact Match (EM)	53.68	55.05	+1.37
WebQuestions	Exact Match (EM)	24.95	29.77	+4.82
PopQA	F1 Score	46.22	47.96	+1.74
Ablation study confirms the necessity of both fabrication generation and list permutation.
Natural Questions	Exact Match (EM)	48.25	51.17	+2.92

Experiment Figures

Robustness curves showing model performance (EM) as the number of fabricated documents increases (from 0 to 9 out of 10 docs).

Visualization of Attack Intensity (Log Loss of Generator) over training iterations.

Main Takeaways

Adversarial tuning significantly improves robustness against hallucinated/fabricated content compared to standard RAG fine-tuning.
The method generalizes well to fabrications from unseen LLMs (e.g., Llama-3-70B) despite being trained with a weaker Attacker.
List permutation in the Attacker helps mitigate 'Lost in the Middle' phenomena by forcing the Generator to find evidence anywhere in the list.
Performance improves iteratively (Iter 1 -> Iter 2 -> Iter 3), validating the multi-round training dynamics.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) architecture
Adversarial training concepts (GANs)
Direct Preference Optimization (DPO)
Language Model Perplexity (PPL)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by searching for relevant documents

DPO: Direct Preference Optimization—a method to align language models to preferences without a separate reward model

MITO: Multi-agent Iterative Tuning Optimization—the specific co-training loss used in this paper to update Generator and Attacker

PPL: Perplexity—a measurement of how well a probability model predicts a sample; lower is better prediction, higher indicates surprise

SFT: Supervised Fine-Tuning—training on labeled data

KL Divergence: Kullback-Leibler Divergence—a measure of how one probability distribution differs from a second, reference probability distribution

Attacker: An auxiliary LLM agent in this system designed to generate misleading documents

Generator: The main RAG LLM responsible for producing the final answer