J. Wang, Qiushi Sun, Nuo Chen, Xiang Lorraine Li, Ming Gao
East China Normal University,
The University of Hong Kong
Annual Meeting of the Association for Computational Linguistics
(2023)
FactualityReasoningKG
📝 Paper Summary
Hallucination suppressionReasoning elicitation
CoK prompting mitigates hallucinations by urging LLMs to generate structured knowledge triples as evidence, verified for factuality and faithfulness to trigger rethinking if unreliable.
Core Problem
Standard Chain-of-Thought (CoT) prompting often leads to hallucinations where generated rationales are unfactual or unfaithful, causing wrong answers despite seemingly logical steps.
Why it matters:
LLMs generate plausible but fake reasoning steps (e.g., wrong player profession), leading to incorrect conclusions.
Faithfulness gaps occur when reasoning chains are logically sound but do not actually support the final answer derived by the model.
Concrete Example:Query: 'Is the following sentence plausible: Derrick White backhanded a shot.' Standard CoT fails by hallucinating 'Derrick White is most likely a hockey player' (unfactual). CoK retrieves the triple (Derrick White, is a, Basketball player), correcting the reasoning to 'False'.
Key Novelty
Chain-of-Knowledge (CoK) Prompting with F2-Verification
Replaces vague textual rationales with 'Evidence Triples' (structured subject-relation-object data) combined with explanation hints, mimicking a human mind map.
Introduces F2-Verification to calculate scores for both Factuality (match with Knowledge Base) and Faithfulness (consistency between rationale and answer).
Implements a 'Rethinking' loop: if the reliability score is below a threshold, the system injects correct knowledge triples and prompts the LLM to generate the answer again.
Architecture
The overall framework of CoK, including Exemplar Construction and the Inference process with F2-Verification and Rethinking.
Evaluation Highlights
+9.4% improvement on CommonsenseQA compared to standard Chain-of-Thought (CoT) prompting.
Outperforms Auto-CoT by +6.1% on the StrategyQA benchmark.
Achieves higher performance than self-consistency methods on arithmetic reasoning tasks like GSM8K.
Breakthrough Assessment
7/10
Novel integration of structured knowledge triples into the prompting sequence itself, combined with a dynamic verification and rethinking loop. Strong empirical gains on reasoning tasks.
Outputs: Predicted answer A supported by a reasoning chain
Pipeline Flow
Exemplar Construction (Offline)
CoK Generation (Inference)
F2-Verification (Inference)
Rethinking (Inference Loop)
System Modules
CoK Prompt Construction
Concatenates labeled exemplars (containing manually annotated triples and explanations) with the test query.
Model or implementation: Prompt Template
Reasoning Generator
Generates evidence triples (CoK-ET), explanation hints (CoK-EH), and a candidate answer.
Model or implementation: LLM (e.g., text-davinci-003)
Factuality Verifier (Verification)
Checks generated triples against a Knowledge Base using exact matching or TransR embedding scores.
Model or implementation: Exact Match / TransR energy function
Faithfulness Verifier (Verification)
Measures semantic similarity between the explanation/evidence and the final answer.
Model or implementation: SimCSE
Rethinking Controller
Compares aggregated reliability score against threshold theta. If low, injects correct triples from KB and loops back to generation.
Model or implementation: Algorithm 1
Novel Architectural Elements
Integration of structured 'Evidence Triples' directly into the prompting output format.
Post-hoc F2-Verification loop that triggers a specific 'Rethinking' step with knowledge injection.
Modeling
Base Model: text-davinci-003 (InstructGPT)
Comparison to Prior Work
vs. CoT: CoK enforces structured output (triples) rather than free-text only, reducing ambiguity.
vs. Self-Consistency: CoK uses explicit knowledge verification (F2) rather than just statistical consensus.
vs. REACT [not cited in paper]: CoK generates evidence as part of the chain rather than calling external tools iteratively as actions; CoK is a prompting strategy with a verification loop.
Limitations
Relies on the availability and coverage of an external Knowledge Base for verification.
Manual annotation is required for the initial CoK exemplars (evidence triples).
Inference cost is higher due to the verification and potential rethinking loops.
CoK-ET: Evidence Triples—structured knowledge triples (subject, relation, object) generated by the LLM to support reasoning.
CoK-EH: Explanation Hints—textual explanations accompanying the evidence triples.
F2-Verification: A mechanism to estimate reliability based on Factuality (matching evidence to a KB) and Faithfulness (consistency between evidence and answer).
SimCSE: A sentence embedding model used here to measure semantic similarity between the reasoning chain and the final answer for faithfulness verification.
TransR: A knowledge graph embedding method used to score the validity of triples not found exactly in the knowledge base.
Hallucination: Generated content that is nonsensical or unfaithful to the provided source content/facts.
Self-consistency: A decoding strategy that samples multiple reasoning paths and selects the most consistent answer.