Evaluation Setup
Multi-hop QA on 2WikiMultihopQA dataset
Benchmarks:
- 2WikiMultihopQA (Multi-hop reasoning QA)
Metrics:
- F1 score
- Exact Match (EM)
- Number of Retrieval Calls (Efficiency)
- Statistical methodology: Runs performed 3 times; averages reported. No statistical significance tests reported.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparison of different uncertainty triggers against baselines on the smaller seed set (25 examples). |
| 2WikiMultihopQA (Small Set) |
F1 |
0.538 |
0.605 |
+0.067
|
| 2WikiMultihopQA (Small Set) |
F1 |
0.538 |
0.411 |
-0.127
|
| Results on the larger set (75 examples) showing trade-offs between accuracy and retrieval frequency. |
| 2WikiMultihopQA (Large Set) |
F1 |
0.597 |
0.561 |
-0.036
|
| 2WikiMultihopQA (Large Set) |
Number of Searches |
291.0 |
153.3 |
-137.7
|
Main Takeaways
- Eccentricity-based uncertainty detection offers the best balance, improving F1 on small sets and maintaining competitive F1 on larger sets while halving retrieval calls.
- Lightweight metrics like Degree Matrix (Jaccard) are effective for minimizing retrieval costs but sacrifice some accuracy compared to 'Always Retrieve'.
- Always Retrieve is still a strong baseline if retrieval cost is not a concern, outperforming conditional methods in raw F1 on the larger dataset.
- Complex semantic clustering (Semantic Sets) underperformed in this specific conditional retrieval setup.