Evaluation Setup
Comparison against CoT and RAG-CoT baselines across multiple knowledge-intensive tasks
Benchmarks:
- HotpotQA (Complex Factual QA)
- Natural Questions (Complex Factual QA)
Metrics:
- Factual Accuracy
- Hallucination Rate
- Citation Quality (F1 score assessing precision, relevance, verifiability)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Complex Factual QA (Aggregated) |
Factual Accuracy |
72% |
83% |
+11%
|
| Complex Factual QA (Aggregated) |
Hallucination Rate |
25% |
12% |
-13%
|
Main Takeaways
- Consistently outperforms traditional CoT and basic RAG-enhanced CoT across evaluated tasks (QA, Summarization, Explanatory Generation).
- Successfully reduces hallucination rates without requiring external knowledge bases or architectural changes.
- Demonstrates that LLMs have an inherent capacity for 'self-correction' when prompted to explicitly simulate verification and citation processes.