NeoQA: Evidence-based Question Answering with Generated News Events

📝 Paper Summary

Benchmark datasets Factuality Modularized RAG pipeline

NEOQA is a benchmark of fictional news timelines and questions designed to force LLMs to reason solely from retrieved evidence rather than parametric memory, exposing reliance on shortcuts.

Core Problem

RAG benchmarks quickly become stale as newer LLMs internalize the test data's facts during pre-training, making it impossible to distinguish genuine retrieval-based reasoning from simple memory recall.

Why it matters:

Trustworthy RAG systems must ground answers in provided documents, not hallucinate from outdated or irrelevant internal knowledge
Existing benchmarks fail to penalize 'shortcut reasoning' where models guess answers without sufficient evidence because the facts are known to them
Current evaluation methods cannot accurately measure a model's ability to deflect (refuse to answer) when evidence is genuinely insufficient

Concrete Example: In RealTimeQA, GPT-4 Turbo answers older questions correctly without any documents because it memorized the news. In NEOQA, answering 'What did Selvia Renek question about the certification program?' requires combining two fictional documents; if one is missing, the model must deflect, but models often halluncinate an answer anyway.

Key Novelty

Fictional World Generation for Clean RAG Evaluation

Generates entirely fictional timelines of events and named entities using GPT-4o to ensure no pre-training data contamination exists
Constructs 'parallel worlds' that follow physical laws but contain unique facts, forcing models to rely exclusively on provided context
Automatically generates pairs of questions and evidence sets that are explicitly sufficient, insufficient, or misleading to test precise reasoning boundaries

Architecture

The iterative process for generating the fictional timeline.

Evaluation Highlights

Models struggle significantly with 'insufficient evidence' scenarios: Phi3-mini achieves only 3.1% accuracy on multi-hop questions where bridge entities are missing, failing to deflect.
Larger models (Qwen2.5-32B) perform better overall (ADTScore 53.2) but still fail to detect subtle false premises, scoring only 26.7% on 'False Premise' questions.
Performance drops sharply as irrelevant documents increase: accuracy on answerable questions falls by ~20% for smaller models when 20 distractor documents are added.

Breakthrough Assessment

8/10

A highly necessary methodological shift for RAG evaluation. By effectively 'sanitizing' the knowledge cutoff problem via fiction, it offers a more rigorous test of reasoning than real-world datasets.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot Multiple-Choice Question Answering with provided evidence documents (RAG setting)

Inputs: A question q, a set of evidence documents D (news articles), and 7 candidate answers (1 correct, 1 deflection, 5 distractors)

Outputs: Selection of the correct answer or the 'unanswerable' deflection option

Pipeline Flow

Event Generation (Sequential generation of 10 fictional events per timeline)
Article Generation (News articles with different styles generated from event outlines)
QA Generation (Multi-hop, time-span, false premise, and uncertain specificity questions generated from outline items)
Instance Construction (Pairing questions with specific subsets of articles to create Sufficient/Insufficient evidence scenarios)

System Modules

Timeline Generator (Data Generation)

Create consistent, sequential fictional events and update the knowledge base of entities

Model or implementation: GPT-4o

Question Generator (Data Generation)

Generate questions requiring multi-hop or temporal reasoning over specific outline items

Model or implementation: GPT-4o

Novel Architectural Elements

Independent Fictional Timelines: A data structure where events are sequentially generated to be internally consistent but completely disjoint from real-world history to neutralize parametric knowledge
Evidence-Controlled Instance Generation: Automatically mapping questions to specific 'atomic' facts allows programmatic creation of 'insufficient evidence' samples by removing the exact document containing a required fact

Modeling

Base Model: Evaluated models: Qwen2.5 (7B, 14B, 32B), Phi3 (mini, small, medium), Phi3.5 MoE

Training Method: Zero-shot evaluation with specific prompting strategies

Adaptation: None (Zero-shot inference)

Trainable Parameters: None (Inference only)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RealTimeQA: NEOQA uses entirely fictional events, whereas RealTimeQA becomes solvable via parametric memory as models are updated
vs. Michelangelo: NEOQA features recurring entities and complex multi-event timelines rather than isolated context chunks
vs. HotpotQA: NEOQA explicitly constructs 'insufficient evidence' cases where the answer exists in the 'world' but is missing from the 'context', penalizing shortcut guessing
+ 1 more
vs. Situational QA [not cited in paper]: Situational QA tests temporal updates to facts, but NEOQA creates a separate fictional universe to avoid all prior knowledge interference

Limitations

Generated timelines may reflect social biases of the generator model (GPT-4o)
Limited to English language
No human evaluation of 'naturalness' of the fictional news articles
Cannot be used for fine-tuning without risk of overfitting to the generation patterns

Reproducibility

Code: https://github.com/amazon-science/neoqa

Dataset and code publicly available at https://github.com/amazon-science/neoqa. Data generation uses GPT-4o. Evaluation prompts are selected via a development set of 3 timelines.

📊 Experiments & Results

Evaluation Setup

Zero-shot multiple-choice QA with provided context documents (up to 120 documents)

Benchmarks:

NEOQA (Evidence-based QA with fictional data) [New]

Metrics:

ADTScore (Answer Deflection Tradeoff Score)
Accuracy (answerable)
Accuracy (unanswerable/deflection)
Statistical methodology: Reported phi coefficient for correlation analysis between accuracy types

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on answerable vs. unanswerable questions shows models struggle to deflect when they should.
NEOQA	ADTScore	14.3	53.2	+38.9
NEOQA	ADTScore	12.9	53.2	+40.3
Detailed breakdown by question type for the best model (Qwen2.5 32B) reveals specific weaknesses in false premise detection.
NEOQA	Accuracy (Answerable Multi-hop)	14.3	79.4	+65.1
NEOQA	Accuracy (Unanswerable False Premise)	14.3	26.7	+12.4
NEOQA	Accuracy (Unanswerable Uncertain Specificity)	14.3	38.6	+24.3
NEOQA	Accuracy (Unanswerable Multi-hop)	14.3	41.7	+27.4

Experiment Figures

Deflection rates for multi-hop questions across models when different pieces of evidence are missing.

Impact of irrelevant documents on ADTScore and accuracy.

Main Takeaways

LLMs exhibit severe 'shortcut reasoning': when a bridge entity is missing in multi-hop questions, models frequently hallucinate the answer (69.7%-90.7% of errors) rather than deflecting.
Performance is negatively correlated between sufficient and insufficient evidence settings: models that are more eager to answer correctly often fail to refuse when they should.
Chain-of-Thought (CoT) prompting helps smaller models (Phi3 family) deflect more often but can degrade performance on answerable multi-hop questions.
Adding irrelevant documents consistently degrades performance, with accuracy dropping steeply within the first 20 irrelevant documents added.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with LLM evaluation benchmarks
Concept of 'parametric knowledge' vs. 'contextual knowledge'

Key Terms

Parametric knowledge: Information stored in the model's weights during pre-training

Shortcut reasoning: When a model guesses an answer using heuristics or partial information instead of fully reasoning through the required evidence

Bridge entity: An entity that connects two separate pieces of information necessary to answer a multi-hop question

ADTScore: Answer Deflection Tradeoff Score—a harmonic mean of accuracy on answerable questions and accuracy on unanswerable (deflection) questions

Deflection: The model's ability to refuse to answer ('I don't know') when evidence is insufficient or the question contains a false premise

False premise question: A question containing an assumption that contradicts the provided evidence

Uncertain specificity question: A question asking for details too specific to be found in the provided evidence

CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer