Evaluation Setup
Open-domain and Private-domain QA with retrieval. Models must generate long-form answers with citations based on provided context.
Benchmarks:
- GaRAGe (Retrieval-Augmented Generation (QA)) [New]
Metrics:
- Relevance-Aware Factuality (RAF)
- Unadjusted Relevance-Aware Factuality (uRAF)
- Eligibility (faithfulness to user intent)
- Deflection True Positive Rate (TPR)
- Citation F1 (Attribution)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Ξ |
| Main results comparing model performance on Eligibility (answering the prompt well) and Relevance-Aware Factuality (answering using ONLY relevant docs). |
| GaRAGe |
Eligibility |
90.2 |
92.6 |
+2.4
|
| GaRAGe |
RAF (Relevance-Aware Factuality) |
57.7 |
60.0 |
+2.3
|
| GaRAGe |
uRAF (Unadjusted Factuality) |
62.3 |
63.8 |
+1.5
|
| Deflection experiments measure the ability to refuse to answer when grounding is insufficient (True Positive) vs. refusing when grounding is sufficient (False Positive). |
| GaRAGe (Deflection Subset) |
True Positive Rate (TPR) |
26.9 |
31.1 |
+4.2
|
| GaRAGe (Sufficient Subset) |
False Positive Rate (FPR) |
0.3 |
1.4 |
+1.1
|
| Attribution results measure how accurately models cite the relevant sources in their generated answers. |
| GaRAGe |
Citation F1 |
57.3 |
58.9 |
+1.6
|
Main Takeaways
- Over-summarization issue: Models tend to incorporate information from all retrieved chunks, failing to distinguish between relevant and irrelevant/outdated snippets (evidenced by low RAF scores).
- Deflection failure: Even strong models like GPT-4o only correctly refuse to answer ~30% of the time when grounding is insufficient, posing a risk for reliable RAG.
- Temporal reasoning gap: Models perform ~10% worse on 'Fast-Changing' questions, suggesting they struggle to identify which document is the most current.
- Domain sensitivity: Performance drops significantly (>10%) on private/specific domains (e.g., Enron emails) compared to general Web search topics.