Evaluation Setup
Zero-shot multiple-choice QA with provided context documents (up to 120 documents)
Benchmarks:
- NEOQA (Evidence-based QA with fictional data) [New]
Metrics:
- ADTScore (Answer Deflection Tradeoff Score)
- Accuracy (answerable)
- Accuracy (unanswerable/deflection)
- Statistical methodology: Reported phi coefficient for correlation analysis between accuracy types
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance on answerable vs. unanswerable questions shows models struggle to deflect when they should. |
| NEOQA |
ADTScore |
14.3 |
53.2 |
+38.9
|
| NEOQA |
ADTScore |
12.9 |
53.2 |
+40.3
|
| Detailed breakdown by question type for the best model (Qwen2.5 32B) reveals specific weaknesses in false premise detection. |
| NEOQA |
Accuracy (Answerable Multi-hop) |
14.3 |
79.4 |
+65.1
|
| NEOQA |
Accuracy (Unanswerable False Premise) |
14.3 |
26.7 |
+12.4
|
| NEOQA |
Accuracy (Unanswerable Uncertain Specificity) |
14.3 |
38.6 |
+24.3
|
| NEOQA |
Accuracy (Unanswerable Multi-hop) |
14.3 |
41.7 |
+27.4
|
Main Takeaways
- LLMs exhibit severe 'shortcut reasoning': when a bridge entity is missing in multi-hop questions, models frequently hallucinate the answer (69.7%-90.7% of errors) rather than deflecting.
- Performance is negatively correlated between sufficient and insufficient evidence settings: models that are more eager to answer correctly often fail to refuse when they should.
- Chain-of-Thought (CoT) prompting helps smaller models (Phi3 family) deflect more often but can degrade performance on answerable multi-hop questions.
- Adding irrelevant documents consistently degrades performance, with accuracy dropping steeply within the first 20 irrelevant documents added.