Real-time Factuality Assessment from Adversarial Feedback

📝 Paper Summary

Factuality Assessment Adversarial Evaluation

This paper introduces an adversarial pipeline that uses feedback from RAG-based detectors to iteratively generate deceptive real-time fake news, revealing that existing LLMs struggle to detect such misinformation without up-to-date retrieval.

Core Problem

Existing fake news datasets (e.g., PolitiFact) are often contaminated in LLM pre-training or contain shallow patterns that models learn as shortcuts, failing to test true reasoning about current events.

Why it matters:

LLMs achieve near-perfect performance on older claims due to data contamination, creating a false sense of security about their fact-checking abilities
Current evaluation methods do not adequately test an LLM's ability to reason about unfolding real-time events where parametric knowledge is insufficient
Standard neural fake news generation is easily detected by strong models, failing to provide a rigorous testbed for modern detectors

Concrete Example: When checking a claim about a 2024 event like an Iranian election, a standard LLM detector might rely on outdated 2022 patterns. The proposed generator iteratively rewrites the claim—first swapping the country to 'Saudi Arabia' (detected easily), then refining it to a plausible 'fuel price hike' cause—eventually tricking the detector.

Key Novelty

Adversarial Iterative News Rewriting with RAG Feedback

Uses a feedback loop where a 'Generator' LLM rewrites news based on rationales provided by a 'Detector' LLM, specifically targeting the detector's reasoning gaps
Incorporates real-time retrieval (RAG) into the adversary's feedback, allowing the generator to craft misinformation that is harder to debunk even with external evidence
Filters generated candidates using a separate contradiction detector to ensure they remain fake while maximizing plausibility

Architecture

The iterative adversarial fake news generation pipeline

Evaluation Highlights

The iterative rewrite process reduces the AUC-ROC of a strong RAG-based GPT-4o detector by 17.5 absolute percentage points (from 82.4% to 64.9%)
Retrieval-free detectors (e.g., GPT-4o without RAG) perform near random guessing (48.8% AUC) on the generated dataset, proving vulnerability to unseen events
The generated dataset is significantly harder than previous benchmarks; GPT-4o achieves ~84% AUC on prior neural fake news but only ~49% on this new dataset

Breakthrough Assessment

8/10

Effective demonstration of how to break SOTA RAG detectors using adversarial feedback. Highlights critical weaknesses in current factuality evaluation benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of news articles as real or fake, specifically focusing on real-time events beyond model knowledge cutoffs

Inputs: A news article x_i

Outputs: A plausibility score y_hat_i in [0, 1] indicating the likelihood the news is factual

Pipeline Flow

Rewrite Generation: Generator creates fake candidates from true news
Contradiction Filtering: Filter out candidates that don't contradict the original (ensure they are actually fake)
RAG Detection & Ranking: Detector scores candidates; RAG rationale is fed back to Generator
Iterative Loop: Repeat for k rounds to refine deception

System Modules

Generator

Rewrite true news into deceptive fake news based on detector feedback

Model or implementation: GPT-4o

Contradiction Detector

Verify that generated candidates actually contradict the original true news

Model or implementation: GPT-4o

RAG Detector (Adversary)

Score plausibility of candidates and provide rationales to guide the generator

Model or implementation: GPT-4o (with News retriever)

Novel Architectural Elements

Iterative adversarial loop where RAG-based rationales are explicitly fed back to the generator to target specific verification weaknesses

Modeling

Base Model: GPT-4o (used for Generator, Detector, and Ranker)

Training Method: Inference-time iterative generation (no weight updates)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Su et al. (2023b): Uses iterative RAG-based feedback loop vs. single-pass open-ended generation
vs. Chen and Shu (2024): Incorporates detector rationales to guide generation vs. unguided manipulation strategies
vs. Radar [not cited in paper]: Uses natural language feedback (rationales) for generation vs. token-level probability optimization

Limitations

Relies on English-language U.S. news (NBC), limiting multilingual generalization
LLM-generated fake news patterns may differ from human-written misinformation
High reliance on the quality of the specific retriever and seed news source used

Reproducibility

Code: https://github.com/sanxing-chen/adv-fake

Code and data are publicly available at https://github.com/sanxing-chen/adv-fake. The dataset uses NBC News articles from March 2024. The exact prompts are provided in the Appendix.

📊 Experiments & Results

Evaluation Setup

Binary classification (True/Fake) on newly generated news from March 2024

Benchmarks:

Generated Dataset (Ours) (Real-time fake news detection) [New]
PolitiFact (Historical) (Fact-checking claims)
Snopes (Historical) (Fact-checking claims)

Metrics:

AUC-ROC
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Background study results showing LLM detectors perform surprisingly well on old PolitiFact data even without retrieval, but struggle on Snopes without retrieval.
PolitiFact (2024 data)	AUC-ROC	Not reported in the paper	0.900	-
Snopes (2024 data)	AUC-ROC	0.800	0.980	+0.180
Main results on the new adversarial dataset showing the effectiveness of the iterative attack against RAG detectors.
Generated Dataset (Ours)	AUC-ROC	82.4	64.9	-17.5
Generated Dataset (Ours)	AUC-ROC	58.5	48.8	-9.7
Generated Dataset (Ours)	AUC-ROC	81.3	67.4	-13.9
Generated Dataset (Ours)	AUC-ROC	71.4	64.9	-6.5

Experiment Figures

Performance of detectors on PolitiFact and Snopes over time

Distribution of plausibility scores for real vs. fake news

Main Takeaways

Retrieval-free LLM detectors are highly vulnerable to adversarial attacks on unseen news, performing near random guessing.
Providing RAG-based rationales as feedback allows the generator to learn 'semantic traps' that exploit specific weaknesses in how detectors process retrieved evidence.
Chain-of-Thought (CoT) reasoning only improves detection performance when combined with RAG; without external knowledge, reasoning does not help.
The dataset created via this pipeline is significantly harder than previous neural fake news datasets (e.g., those from Chen & Shu 2024), reducing GPT-4o performance from ~85% to ~49%.

📚 Prerequisite Knowledge

Prerequisites

Understanding of RAG (Retrieval-Augmented Generation) pipelines
Familiarity with LLM-based evaluation (LLM-as-a-judge)
Basic knowledge of adversarial generation concepts

Key Terms

RAG: Retrieval-Augmented Generation—systems that retrieve external documents to ground LLM responses in up-to-date information

AUC-ROC: Area Under the Receiver Operating Characteristic Curve—a metric measuring a classifier's ability to distinguish between classes across all thresholds (0.5 is random, 1.0 is perfect)

Levenshtein distance: A string metric for measuring the difference between two sequences; used here to limit how much the fake news deviates from the original text

chain-of-thought: Prompting technique where the model produces intermediate reasoning steps before the final answer

parametric knowledge: Information stored within the model's pre-trained weights, as opposed to information retrieved from external sources

zero-shot prompting: Asking a model to perform a task without providing any specific training examples in the prompt