KDCM: Reducing Hallucination in LLMs through Explicit Reasoning Structures

📝 Paper Summary

Prompt-Induced Hallucination Mitigation Chain-of-Thought Reasoning Knowledge Distillation

KDCM mitigates prompt-induced hallucinations by embedding executable code into the reasoning prompt to guide knowledge exploration and enforce structured intermediate steps.

Core Problem

Large language models frequently suffer from prompt-induced hallucinations, where ambiguous or misleading prompts cause models to generate fluent but factually incorrect reasoning traces.

Why it matters:

Reliance on internal probabilistic prediction leads to error accumulation across multi-step reasoning tasks
Existing retrieval or verification methods often fail to explicitly constrain the model's internal reasoning process itself
Hallucinations prevent reliable deployment in safety-critical domains like scientific research and clinical decision support

Concrete Example: When a model is given an underspecified prompt about a complex entity relationship, it might hallucinate a plausible-sounding connection. KDCM instead generates code to traverse a knowledge graph, validating the relationship step-by-step before answering.

Key Novelty

Code-Guided Knowledge Distillation Chain

Embeds a programmable module (executable code) within reasoning prompts to act as an explicit control signal for knowledge exploration
Reformulates input prompts into structured sub-problems and uses code to traverse external knowledge graphs, constraining intermediate reasoning steps
Integrates this structured guidance into a knowledge distillation framework to teach the model to self-correct and verify its own logic

Evaluation Highlights

+15.64% improvement in HIT@1 compared to baselines using GPT-4 and LLaMA-3.3
Scores exceeding 95% on HIT@1, HIT@3, and HIT@5 across several evaluation settings
Robust performance maintenance even when prompts are deliberately made underspecified or ambiguous

Breakthrough Assessment

7/10

Strong empirical results (>95% HIT scores) and a novel integration of executable code into the reasoning/distillation loop, though reliance on external structured knowledge may limit applicability in unstructured domains.

⚙️ Technical Details

Problem Definition

Setting: Question answering and reasoning under ambiguous or noisy prompts

Inputs: Natural language query/prompt q

Outputs: Predicted answer a, grounded in structured reasoning traces

Pipeline Flow

Prompt Reformulation (Decompose into structured sub-problems)
Code Generation & Execution (Generate code to explore external Knowledge Graph)
Distillation Chain Reasoning (Generate intermediate traces constrained by code results)
Answer Generation (Final output grounded in validated steps)

System Modules

Prompt Reformulator

Reformulate input prompt into a set of structured sub-problems

Model or implementation: LLM (GPT-4 or LLaMA-3.3)

Code-Guided Explorer (Reasoning & Retrieval)

Generate and execute code to traverse knowledge graphs for sub-problems

Model or implementation: Programmable module embedded in prompt

Distillation Chain Generator (Reasoning & Retrieval)

Produce intermediate reasoning traces constrained by external knowledge

Model or implementation: LLM (Student)

Answer Generator

Generate final answer based on validated reasoning

Model or implementation: LLM

Novel Architectural Elements

embedding executable code as a programmable module within the reasoning prompt itself to guide knowledge graph exploration
integration of code-execution results directly into a chain-style knowledge distillation framework to regulate intermediate reasoning

Modeling

Base Model: GPT-4 and LLaMA-3.3

Training Method: Knowledge Distillation Chain framework (Distillation-based reasoning)

Adaptation: Enhanced distillation chain

Training Data:

WebQuestionsSP
WebQSP
Complex Web Questions (CWQ)
GSM8K
MWP
Dr. SPIDER

Compute: Not reported in the paper

Comparison to Prior Work

vs. KG-LLM-PR: KDCM uses executable code to actively guide exploration rather than just prompting with KG data
vs. LLM-SubKG-Sum: KDCM integrates the structured knowledge into a distillation chain for step-by-step regulation rather than just summarizing sub-graphs
vs. Chain-of-Thought (CoT): KDCM imposes external structural constraints via code, whereas standard CoT relies on internal probabilistic generation [not cited in paper]

Limitations

Incorporating code-guided reasoning increases inference complexity and computational cost
Depends on access to high-quality structured external knowledge (Knowledge Graphs)
Performance in open-ended generation and multimodal settings requires further validation

Reproducibility

Code availability is not explicitly provided in the paper text. Datasets used are public (WebQuestionsSP, GSM8K, etc.). Specific prompt templates or code implementation details for the programmable module are not fully detailed.

📊 Experiments & Results

Evaluation Setup

Question answering across multiple benchmarks using retrieval-based metrics

Benchmarks:

WebQuestionsSP (Knowledge Base QA)
WebQSP (Knowledge Base QA)
CWQ (Complex Web Questions) (Complex QA)
GSM8K (Math Word Problems)
MWP (Math Word Problems)
Dr. SPIDER (Text-to-SQL / Structural reasoning)

Metrics:

HIT@1
HIT@3
HIT@5
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The proposed method (KDCM) shows significant improvements in HIT@K metrics compared to baselines, indicating reduced hallucination.
Average across datasets	HIT@1 improvement	Not reported in the paper	Not reported in the paper	+15.64%
Average across datasets	HIT@3 improvement	Not reported in the paper	Not reported in the paper	+13.38%
Average across datasets	HIT@5 improvement	Not reported in the paper	Not reported in the paper	+13.28%
Several evaluation settings	HIT@1 / HIT@3 / HIT@5	Not reported in the paper	>95.00	Not reported in the paper

Main Takeaways

Code-guided reasoning significantly improves contextual modeling and reduces prompt-induced hallucinations.
The method demonstrates strong generalization across diverse datasets (QA, Math, SQL-related).
Explicit regulation of intermediate reasoning steps effectively constrains erroneous reasoning trajectories.
Robustness is maintained even when prompts are underspecified or ambiguous.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) reasoning
Knowledge Distillation
Knowledge Graphs (KG)
Large Language Models (LLMs)

Key Terms

HIT@K: A metric measuring the proportion of test queries where the correct answer appears in the top K predicted candidates

Knowledge Distillation: A process where a smaller or student model learns to replicate the behavior or knowledge of a larger teacher model

Chain-of-Thought: A prompting technique that encourages models to generate intermediate reasoning steps before the final answer

Prompt-Induced Hallucination: Errors where a model generates incorrect facts specifically triggered by misleading, ambiguous, or underspecified prompts

Knowledge Graph: A structured representation of knowledge using entities (nodes) and relations (edges)

Code-Guided Reasoning: Using executable code snippets within the inference process to structurally traverse data or verify logic, rather than relying solely on natural language generation