DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management

📝 Paper Summary

Disaster Response QA Domain-Specific Benchmarking

DisastQA introduces a large-scale, human-verified disaster management benchmark that evaluates LLMs' reasoning under retrieval noise and measures factual completeness using a novel keypoint-based protocol.

Core Problem

Existing QA benchmarks focus on general knowledge or clean evidence, failing to capture the fragmented, noisy, and high-stakes nature of information in disaster management.

Why it matters:

Disaster response requires synthesizing information from noisy, incomplete sources (social media, bulletins), which standard clean-context benchmarks do not simulate
Current benchmarks prioritize multiple-choice accuracy or surface-level lexical overlap (ROUGE), failing to measure the factual completeness essential for decision-critical advice
Reliability gaps in high-stakes scenarios remain unmeasured as models are rarely tested against realistic retrieval noise or specific disaster constraints

Concrete Example: A decision-maker needs a complete list of evacuation routes and shelter locations. A standard model might provide a fluent but incomplete answer missing a crucial closed bridge. DisastQA's keypoint metric penalizes this missing fact, whereas ROUGE might score it highly for lexical overlap.

Key Novelty

Tri-Level Evidence Evaluation & Keypoint-Based Completeness

Constructs a benchmark using a Human-LLM collaboration pipeline where LLMs generate drafts from real disaster queries and humans rigorously verify facts and refine distractors
Evaluates models in three distinct contexts (Base, Golden, Mix) to disentangle internal knowledge limits from the ability to reason over noisy, retrieved evidence
Introduces 'Keypoint Coverage' for open-ended QA, a metric that decomposes reference answers into atomic facts to measure strict factual recall rather than n-gram similarity

Architecture

The Human-LLM collaboration pipeline for constructing DisastQA.

Evaluation Highlights

Frontier models like GPT-4o achieve high accuracy in clean settings but degrade significantly (e.g., performance drops) when exposed to retrieval noise (Mix setting)
Open-weight models like Qwen-2.5-72B-Instruct now approach proprietary leaders in clean contexts, narrowing the capability gap
Even top models fail to achieve perfect Keypoint Coverage in open-ended tasks, revealing persistent gaps in factual completeness despite high fluency

Breakthrough Assessment

8/10

Significant contribution to domain-specific safety evaluation. The rigorous human verification and keypoint protocol offer a much-needed alternative to lexical metrics for high-stakes QA.

⚙️ Technical Details

Problem Definition

Setting: Question Answering in disaster contexts involving both Multiple-Choice (discriminative) and Open-Ended (generative) tasks under varying evidence conditions

Inputs: Natural language question q, optional context passage p (which may be clean, noisy, or absent)

Outputs: Selected option (for MCQ) or free-form text response (for Open-Ended)

Pipeline Flow

Query Selection (from DisastIR corpus)
Question Rewriting (LLM rewrites keyword query to question)
Answer Generation (LLM generates correct answer + distractors)
Human Refinement (Experts verify facts, refine distractors, decompose answers into keypoints)

System Modules

Query Rewriter (Data Construction)

Convert keyword-based search queries into well-formed natural language questions

Model or implementation: LLM (assisted by human prompts)

Answer Generator (MCQ) (Data Construction)

Create one correct answer and three plausible distractors

Model or implementation: LLM (assisted by human prompts)

Human Refiner

Verify factual accuracy, refine distractors for difficulty, and decompose open-ended answers

Model or implementation: Human Experts

Novel Architectural Elements

Tri-level evaluation framework (Base, Mix, Golden) integrated directly into the benchmark design to separate reasoning from knowledge
Human-Verified Keypoint Protocol for Open-Ended QA evaluation, replacing n-gram metrics with atomic fact recall

Modeling

Base Model: Evaluation covers 20 models including GPT-4o, GPT-5.2 (reported in paper as frontier), Gemini-1.5-Pro, Qwen-2.5-72B-Instruct, Llama-3-70B-Instruct

Comparison to Prior Work

vs. MMLU-Pro: DisastQA includes noisy evidence integration (Mix) and open-ended keypoint evaluation, targeting domain-specific reliability rather than general knowledge
vs. DisasterQA: DisastQA is large-scale (3k items), includes open-ended tasks, and rigorously evaluates reasoning under retrieval noise
vs. SQuAD [not cited in paper]: DisastQA focuses on multi-aspect synthesis from noisy contexts rather than span extraction from clean single documents

Limitations

Keypoint annotation is manual and labor-intensive, limiting the fully annotated open-ended subset to 200 items
Evaluation focuses on English-language disaster resources, potentially limiting global applicability
The 'Mix' setting uses a fixed number of distractors (k=4), which may not fully capture the scale of noise in real-time social media streams

Reproducibility

Code: https://github.com/Disaster-NLP/DisastQA

All code, data, and evaluation resources are available at the project page (https://github.com/Disaster-NLP/DisastQA). The benchmark includes 3,000 verified questions (2,000 MCQ, 1,000 OE) and the keypoint annotations for the OE subset.

📊 Experiments & Results

Evaluation Setup

Zero-shot QA under three evidence conditions: Base (no context), Golden (perfect context), and Mix (1 gold + 4 distractors)

Benchmarks:

DisastQA-MCQ (Multiple-choice Question Answering) [New]
DisastQA-OE (Open-ended Question Answering) [New]
MMLU-Pro (General domain reasoning)

Metrics:

Exact Match Accuracy (for MCQ)
Keypoint Coverage (for OE)
ROUGE-L
BLEU-4
BERTScore-F1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance of frontier models on DisastQA-MCQ across different evidence settings, showing the impact of noise.
DisastQA-MCQ	Accuracy (Golden Setting)	83.2	94.6	+11.4
DisastQA-MCQ	Accuracy (Mix Setting)	94.6	90.2	-4.4
DisastQA-MCQ	Accuracy (Mix Setting)	63.8	90.2	+26.4
Keypoint Coverage results for Open-Ended QA, highlighting the difficulty of factual completeness.
DisastQA-OE	Keypoint Coverage (Golden Setting)	0.681	0.825	+0.144
DisastQA-OE	Keypoint Coverage (Mix Setting)	0.825	0.793	-0.032

Experiment Figures

Distribution of keypoint counts per question in DisastQA-OE.

Main Takeaways

Open-weight models like Qwen-2.5-72B are closing the gap with proprietary models in clean (Golden) settings but still lag in noisy (Mix) settings.
Reasoning gaps persist: No model achieves perfect Keypoint Coverage, indicating that even frontier models struggle to synthesize all necessary facts for comprehensive disaster response.
Model performance degrades sharply under retrieval noise (Mix setting), confirming that standard clean benchmarks overestimate real-world reliability.
Keypoint Coverage reveals deficiencies in factual completeness that surface-level metrics like ROUGE fail to capture.

📚 Prerequisite Knowledge

Prerequisites

Question Answering (QA) evaluation metrics
RAG (Retrieval-Augmented Generation) concepts
Basic understanding of LLM hallucination and grounding

Key Terms

Keypoint Coverage: A metric measuring the proportion of atomic facts (keypoints) from a gold reference answer that are semantically present in the model's generated response

Mix Setting: An evaluation context where the model receives the correct passage mixed with plausible but irrelevant distractor passages, simulating noisy retrieval

Golden Setting: An oracle evaluation context where the model receives only the correct, ground-truth passage

Base Setting: A closed-book evaluation context where the model answers using only its internal parametric knowledge without external evidence

FActScore: A fine-grained atomic evaluation metric that decomposes text into atomic facts for verification; DisastQA's keypoint approach aligns with this concept

Parametric Knowledge: Information stored within the model's weights during pre-training, as opposed to information provided in the input context

Distractor: In MCQ, an incorrect option; in retrieval contexts, an irrelevant passage included to test the model's ability to filter noise