Evaluation Setup
Controlled testbed using MRQA datasets (NQ, HotpotQA, TriviaQA) where knowledge availability is explicitly manipulated (Valid/Invalid Parametric, Valid/Invalid Context)
Benchmarks:
- Natural Questions (NQ) (Open-domain QA)
- HotpotQA (Multi-hop QA)
- TriviaQA (Trivia QA)
Metrics:
- EM (Exact Match) for generation accuracy
- AUC (Area Under the Curve) for abstention performance
- Jaccard Similarity
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Abstention performance (AUC) comparisons showing CDA's ability to distinguish answerable from unanswerable queries compared to confidence-based baselines. |
| Natural Questions |
AUC |
73.2 |
83.5 |
+10.3
|
| HotpotQA |
AUC |
73.7 |
82.4 |
+8.7
|
| TriviaQA |
AUC |
72.4 |
79.2 |
+6.8
|
| Generation performance (EM) on the 'Answerable' subset of the testbed, demonstrating that adding abstention capabilities does not degrade standard QA performance. |
| Natural Questions |
EM |
45.0 |
47.7 |
+2.7
|
| HotpotQA |
EM |
29.9 |
32.0 |
+2.1
|
| Comparison against training-based abstention methods (R-Tuning) to show the efficacy of the training-free approach. |
| Natural Questions |
AUC |
70.2 |
79.1 |
+8.9
|
Main Takeaways
- CDA consistently outperforms confidence-based baselines (Logit, Self-Consistency) in abstention tasks across multiple datasets and models.
- The method maintains competitive or superior generation accuracy on answerable queries, proving that the abstention mechanism does not interfere with valid knowledge retrieval.
- Ablation studies confirm the necessity of bias calibration; without it, entropy estimates are unreliable.
- CDA generalizes well to RAG settings where the retriever may fetch irrelevant documents, effectively filtering them out via the adaptive weighting mechanism.