Evaluation Setup
Question answering across multiple benchmarks using retrieval-based metrics
Benchmarks:
- WebQuestionsSP (Knowledge Base QA)
- WebQSP (Knowledge Base QA)
- CWQ (Complex Web Questions) (Complex QA)
- GSM8K (Math Word Problems)
- MWP (Math Word Problems)
- Dr. SPIDER (Text-to-SQL / Structural reasoning)
Metrics:
- HIT@1
- HIT@3
- HIT@5
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| The proposed method (KDCM) shows significant improvements in HIT@K metrics compared to baselines, indicating reduced hallucination. |
| Average across datasets |
HIT@1 improvement |
Not reported in the paper |
Not reported in the paper |
+15.64%
|
| Average across datasets |
HIT@3 improvement |
Not reported in the paper |
Not reported in the paper |
+13.38%
|
| Average across datasets |
HIT@5 improvement |
Not reported in the paper |
Not reported in the paper |
+13.28%
|
| Several evaluation settings |
HIT@1 / HIT@3 / HIT@5 |
Not reported in the paper |
>95.00 |
Not reported in the paper
|
Main Takeaways
- Code-guided reasoning significantly improves contextual modeling and reduces prompt-induced hallucinations.
- The method demonstrates strong generalization across diverse datasets (QA, Math, SQL-related).
- Explicit regulation of intermediate reasoning steps effectively constrains erroneous reasoning trajectories.
- Robustness is maintained even when prompts are underspecified or ambiguous.