Evaluation Setup
Open-domain QA, Fact Verification, and Dialog Generation using Wikipedia passages.
Benchmarks:
- NaturalQuestions (NQ) (Open-Domain QA)
- TriviaQA (TQA) (Open-Domain QA)
- HotpotQA (Multi-hop QA)
- ELI5 (Long-Form QA)
- FEVER (Fact Verification)
- Wizard of Wikipedia (WoW) (Knowledge-Grounded Dialog)
Metrics:
- Exact Match (EM)
- F1 Score
- Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance comparison on single-passage retrieval setting using Llama-2-7B. FILCO generally outperforms full context (FULL) and passage-level filtering (PSG). |
| NaturalQuestions (NQ) |
EM |
34.7 |
43.3 |
+8.6
|
| FEVER |
Accuracy |
82.3 |
86.6 |
+4.3
|
| HotpotQA |
F1 |
58.2 |
59.5 |
+1.3
|
| TriviaQA (TQA) |
EM |
60.5 |
60.7 |
+0.2
|
| Performance comparison on multiple-passage (Top-5) setting using Flan-T5-XL. FILCO shows robust gains over standard baselines. |
| FEVER |
Accuracy |
88.1 |
91.4 |
+3.3
|
| NaturalQuestions (NQ) |
EM |
48.3 |
61.8 |
+13.5
|
| Wizard of Wikipedia (WoW) |
F1 |
64.8 |
66.0 |
+1.2
|
Main Takeaways
- FILCO consistently outperforms full-context and passage-filtering baselines across extractive QA, abstractive QA, and dialog tasks.
- Different filtering strategies work best for different tasks: STRINC is best for extractive QA, LEXICAL for dialog, and CXMI for complex/abstractive tasks.
- Sentence-level filtering reduces input token count by 44-64%, improving efficiency.
- Filtering improves performance even when retrieved passages are negative, likely by removing misleading noise.