Evaluation Setup
Factual QA with 3 tasks: Retrieval Summarization, KG/Web Retrieval Augmentation, End-to-end RAG
Benchmarks:
- CRAG (Comprehensive RAG Benchmark) (Factual QA with Retrieval) [New]
Metrics:
- Accuracy (Perfect + Acceptable)
- Truthfulness (Score accounting for hallucinations)
- Hallucination Rate
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Baseline performance of advanced LLMs without RAG shows significant limitations. |
| CRAG |
Accuracy |
Not applicable |
34 |
0
|
| CRAG |
Truthfulness |
Not applicable |
20 |
0
|
| Impact of adding RAG to LLMs. |
| CRAG |
Accuracy |
34 |
44 |
+10
|
| CRAG |
Truthfulness |
20 |
20 |
0
|
| Performance of State-of-the-Art Industry RAG solutions. |
| CRAG |
Accuracy (Non-hallucinating) |
44 |
63 |
+19
|
Main Takeaways
- Naive RAG improves accuracy but degrades truthfulness by introducing hallucinations from irrelevant retrieved content
- Questions about 'Head' entities are handled much better than 'Torso' or 'Tail' entities; GPT-4 truthfulness drops from 21% (Head) to 8% (Tail)
- Knowledge Graph access (Task 2) improves truthfulness over Web-only (Task 1) because structured data is more precise and less noisy
- Dynamic and real-time facts (Finance/Sports) are significantly harder than static facts, with much lower truthfulness scores