Evaluation Setup
RAG benchmarking using a knowledge base of 609 news articles (Sept-Dec 2023)
Benchmarks:
- MultiHop-RAG (Multi-hop RAG Retrieval and Answering) [New]
Metrics:
- MAP@K (Mean Average Precision)
- MRR@K (Mean Reciprocal Rank)
- Hit@K (Hit Rate)
- Generation Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Retrieval performance results showing the difficulty of multi-hop retrieval even for top embedding models. |
| MultiHop-RAG |
Hits@10 |
0.7059 |
0.7467 |
+0.0408
|
| MultiHop-RAG |
MRR@10 |
0.5477 |
0.5860 |
+0.0383
|
| Generation/Reasoning performance results comparing LLMs when given retrieved context vs. perfect ground-truth context. |
| MultiHop-RAG |
Accuracy |
0.28 |
0.56 |
+0.28
|
| MultiHop-RAG |
Accuracy |
0.36 |
0.89 |
+0.53
|
Main Takeaways
- Significant gap exists in retrieval: Even the best embedding model + reranker only finds the necessary evidence 75% of the time (Hits@10)
- Open-source models (Llama-2, Mixtral) struggle heavily with reasoning over multiple documents, even when perfect evidence is provided (max 36% accuracy)
- GPT-4 shows robust reasoning (89% accuracy) given ground truth, suggesting the bottleneck for SOTA models is retrieval, while for open models it is both retrieval and reasoning
- Models perform relatively well on Null queries (detecting unanswerable questions) but fail significantly on Comparison and Temporal queries