Evaluation Setup
Open-domain QA using Natural Questions dataset
Benchmarks:
- Natural Questions (NQ) (Open-domain QA)
Metrics:
- Win-rate (identifying gold docs)
- Kendall's tau (correlation with correctness)
- Mean Reciprocal Rank (MRR)
- Answer Accuracy
- Statistical methodology: Sign test used for statistical significance in win-rate comparisons
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Evaluation of metric properties: Can GroGU identify gold documents and correlate with actual generation accuracy? |
| Natural Questions |
Win Rate (Gold vs Distractor) |
0.686 |
0.745 |
+0.059
|
| Natural Questions |
Kendall's Tau Correlation |
-0.093 |
0.377 |
+0.470
|
| Downstream application: Using GroGU to train a query rewriter. |
| Natural Questions |
Mean Reciprocal Rank (MRR) |
Not reported in the paper |
Not reported in the paper |
Not reported in the paper
|
| Natural Questions |
Answer Accuracy |
Not reported in the paper |
Not reported in the paper |
Not reported in the paper
|
Main Takeaways
- GroGU (KeyEntropy) effectively distinguishes ground-truth documents from high-ranking distractors, outperforming perplexity-based metrics.
- Relevance scores negatively correlate with generation accuracy in scenarios where random documents help generation (noise robustness), whereas GroGU maintains positive correlation.
- Utility is model-specific: A document layout preferred by one model (e.g., Phi) is not necessarily the best for another (e.g., Qwen).
- Training a query-rewriter using GroGU-derived preferences significantly improves both retrieval ranking and final answer accuracy.