Automatic Hallucination Assessment for Aligned Large Language Models via Transferable Adversarial Attacks

📝 Paper Summary

Modularized RAG pipeline RAG Evaluation Adversarial Attacks

ReEval uses a pivot LLM to automatically generate adversarial test cases (via answer swapping or context enriching) to evaluate whether RAG systems rely on retrieved evidence or internal memorization.

Core Problem

Standard RAG evaluation on static datasets (e.g., NQ) is unreliable because LLMs may answer correctly due to pre-training memorization (data contamination) rather than faithfully using the retrieved context.

Why it matters:

High static benchmark scores do not guarantee that models can handle new or private information faithfully
Retrieval-augmented models must reliably ignore internal memory when it conflicts with valid retrieved evidence to avoid hallucination
Existing evaluations fail to distinguish between knowledge utilization (retrieval) and parametric memorization

Concrete Example: A model might correctly answer 'When was the 13th Amendment passed?' using internal memory. If the retrieved evidence is perturbed to say '1900' (to test faithfulness), the model often ignores the evidence and still outputs the memorized fact, revealing a failure to follow context.

Key Novelty

ReEval: LLM-based Prompt Chaining for Adversarial Perturbation

Uses a 'pivot LLM' to generate adversarial examples specifically targeting cases the model already knows (seed cases), ensuring failures are due to unfaithfulness rather than ignorance
Simulates knowledge conflicts via 'Answer Swapping' (replacing the correct answer in evidence with a plausible alternative)
Simulates information overload via 'Context Enriching' (adding extra relevant but non-conflicting details to dilute the evidence)

Architecture

The ReEval framework pipeline showing Seed Selection and two types of Evidence Perturbation.

Evaluation Highlights

GPT-4 shows a massive drop in accuracy on adversarial samples: from ~100% on static data to 56.6% on 'Context Enriching' attacks (Natural Questions)
90.8% to 92.4% of generated adversarial evidence is judged as human-readable and supportive by human annotators
Attacks are transferable: Adversarial examples generated by a small model (Alpaca-7B) successfully trigger hallucinations in larger models like GPT-4

Breakthrough Assessment

8/10

Significant contribution to RAG evaluation by automating the creation of 'knowledge conflict' scenarios. Demonstrates that even SOTA models (GPT-4) ignore context when it conflicts with memory, a critical safety flaw.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering (QA) with provided evidence

Inputs: Question q, Evidence e (potentially perturbed)

Outputs: Predicted answer a

Pipeline Flow

Seed Selection (Pivot LLM filters for open-book correct cases)
Attack Generation (Pivot LLM perturbs evidence via Swapping or Enriching)
Re-evaluation (Target LLMs evaluated on perturbed data)

System Modules

Seed Selector (Data Generation)

Identify cases where the model correctly uses evidence, to establish a baseline for attack

Model or implementation: ChatGPT (gpt-3.5-turbo-0301)

Evidence Perturber (Category 1) (Data Generation)

Generate 'Answer Swapping' attacks

Model or implementation: ChatGPT (gpt-3.5-turbo-0301)

Evidence Perturber (Category 2) (Data Generation)

Generate 'Context Enriching' attacks

Model or implementation: ChatGPT (gpt-3.5-turbo-0301) + Retriever (all-mpnet-base-v2)

Novel Architectural Elements

Automated prompt-chaining framework for generating specific types of RAG-adversarial data (Swap vs. Enrich) without human intervention

Modeling

Base Model: Evaluated models: Alpaca-7B, ChatGPT (gpt-3.5-turbo-0301), Claude 2, PaLM, GPT-4 (gpt-4-0613)

Comparison to Prior Work

vs. Zhou et al.: ReEval generates new data via perturbation rather than just designing prompts
vs. Jia and Liang: ReEval uses LLMs to rewrite/enrich the entire passage naturally rather than just concatenating distractor sentences
vs. Xie et al.: ReEval modifies real-world evidence from Wikipedia rather than fully synthetic triple-based generation
+ 1 more
vs. RGB (Chen et al.) [not cited in paper]: ReEval focuses on modifying content to trigger conflicts/confusion rather than analyzing noise robustness thresholds

Limitations

Dependency on the pivot model's capability (smaller pivot models like Alpaca generate less effective attacks)
Relies on Entailment Accuracy metric which depends on another model (DeBERTa)
Cost of generating data via API calls to proprietary models (ChatGPT/GPT-4)
Focuses only on open-domain QA, not tested on other RAG tasks like summarization

Reproducibility

Code: https://autodebug-llm.github.io/

📊 Experiments & Results

Evaluation Setup

Zero-shot and Few-shot Open-domain QA on perturbed datasets

Benchmarks:

Natural Questions (NQ) (Open-domain QA)
RealTimeQA (Open-domain QA (recent events))

Metrics:

Exact Match (EM)
Token-level F1
Entailment Accuracy (using nli-deberta-v3-base)
Normalized Entailment Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Category 1 (Answer Swapping) results on NQ show significant performance drops across all models when evidence contradicts internal memory.
NQ (Category 1)	Normalized Entailment Accuracy	39.5	39.5	0.0
NQ (Category 1)	Normalized Entailment Accuracy	100.0	73.2	-26.8
NQ (Category 1)	Normalized Entailment Accuracy	100.0	51.1	-48.9
NQ (Category 1)	Normalized Entailment Accuracy	100.0	62.4	-37.6
Category 2 (Context Enriching) results show even larger drops, indicating difficulty extracting relevant info from dense context.
NQ (Category 2)	Normalized Entailment Accuracy	100.0	56.6	-43.4
NQ (Category 2)	Normalized Entailment Accuracy	100.0	41.6	-58.4
Transferability results show attacks generated by weaker models work on stronger ones.
NQ (Category 1)	Normalized Entailment Accuracy	83.6	80.9	-2.7

Experiment Figures

Examples of generated test cases for Answer Swapping and Context Enriching.

Main Takeaways

Accurate models on static data produce unsupported answers from perturbed evidence, with pronounced accuracy drops across LLMs including GPT-4.
Self-attacks (attacks generated by the same model being evaluated) are generally more effective than cross-attacks.
Adversarial examples are transferable: attacks generated by small models (Alpaca) effectively fool much larger models (GPT-4), enabling cost-effective evaluation.
Context enriching (adding relevant but distinct info) is harder for models to handle than answer swapping, often causing them to revert to internal memorization.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Adversarial attacks in NLP
Prompt chaining

Key Terms

Pivot LLM: The language model used to generate seed cases and adversarial perturbations (e.g., ChatGPT)

Seed Test Cases: QA pairs that the pivot LLM can answer correctly using provided evidence (Open-Book Correct)

Answer Swapping: An adversarial attack where the answer in the evidence is replaced with a factually wrong but context-appropriate alternative to test if the model updates its belief

Context Enriching: An adversarial attack where the evidence is expanded with extra relevant information to make reasoning harder, testing robustness to dense context

Closed-book setting: Answering questions using only the model's internal parametric memory without external evidence

Open-book setting: Answering questions provided with external supporting evidence

Prompt Chaining: Breaking a complex task (like generating adversarial data) into a sequence of smaller, connected prompts

Entailment Accuracy: A metric checking if the generated answer is logically entailed by the reference answer, more lenient than exact match