Evaluation Setup
Binary classification (True/Fake) on newly generated news from March 2024
Benchmarks:
- Generated Dataset (Ours) (Real-time fake news detection) [New]
- PolitiFact (Historical) (Fact-checking claims)
- Snopes (Historical) (Fact-checking claims)
Metrics:
- AUC-ROC
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Background study results showing LLM detectors perform surprisingly well on old PolitiFact data even without retrieval, but struggle on Snopes without retrieval. |
| PolitiFact (2024 data) |
AUC-ROC |
Not reported in the paper |
0.900 |
-
|
| Snopes (2024 data) |
AUC-ROC |
0.800 |
0.980 |
+0.180
|
| Main results on the new adversarial dataset showing the effectiveness of the iterative attack against RAG detectors. |
| Generated Dataset (Ours) |
AUC-ROC |
82.4 |
64.9 |
-17.5
|
| Generated Dataset (Ours) |
AUC-ROC |
58.5 |
48.8 |
-9.7
|
| Generated Dataset (Ours) |
AUC-ROC |
81.3 |
67.4 |
-13.9
|
| Generated Dataset (Ours) |
AUC-ROC |
71.4 |
64.9 |
-6.5
|
Main Takeaways
- Retrieval-free LLM detectors are highly vulnerable to adversarial attacks on unseen news, performing near random guessing.
- Providing RAG-based rationales as feedback allows the generator to learn 'semantic traps' that exploit specific weaknesses in how detectors process retrieved evidence.
- Chain-of-Thought (CoT) reasoning only improves detection performance when combined with RAG; without external knowledge, reasoning does not help.
- The dataset created via this pipeline is significantly harder than previous neural fake news datasets (e.g., those from Chen & Shu 2024), reducing GPT-4o performance from ~85% to ~49%.