Boosting Language Models Reasoning with Chain-of-Knowledge Prompting

📝 Paper Summary

Hallucination suppression Reasoning elicitation

CoK prompting mitigates hallucinations by urging LLMs to generate structured knowledge triples as evidence, verified for factuality and faithfulness to trigger rethinking if unreliable.

Core Problem

Standard Chain-of-Thought (CoT) prompting often leads to hallucinations where generated rationales are unfactual or unfaithful, causing wrong answers despite seemingly logical steps.

Why it matters:

LLMs generate plausible but fake reasoning steps (e.g., wrong player profession), leading to incorrect conclusions.
Faithfulness gaps occur when reasoning chains are logically sound but do not actually support the final answer derived by the model.

Concrete Example: Query: 'Is the following sentence plausible: Derrick White backhanded a shot.' Standard CoT fails by hallucinating 'Derrick White is most likely a hockey player' (unfactual). CoK retrieves the triple (Derrick White, is a, Basketball player), correcting the reasoning to 'False'.

Key Novelty

Chain-of-Knowledge (CoK) Prompting with F2-Verification

Replaces vague textual rationales with 'Evidence Triples' (structured subject-relation-object data) combined with explanation hints, mimicking a human mind map.
Introduces F2-Verification to calculate scores for both Factuality (match with Knowledge Base) and Faithfulness (consistency between rationale and answer).
Implements a 'Rethinking' loop: if the reliability score is below a threshold, the system injects correct knowledge triples and prompts the LLM to generate the answer again.

Architecture

The overall framework of CoK, including Exemplar Construction and the Inference process with F2-Verification and Rethinking.

Evaluation Highlights

+9.4% improvement on CommonsenseQA compared to standard Chain-of-Thought (CoT) prompting.
Outperforms Auto-CoT by +6.1% on the StrategyQA benchmark.
Achieves higher performance than self-consistency methods on arithmetic reasoning tasks like GSM8K.

Breakthrough Assessment

7/10

Novel integration of structured knowledge triples into the prompting sequence itself, combined with a dynamic verification and rethinking loop. Strong empirical gains on reasoning tasks.

⚙️ Technical Details

Problem Definition

Setting: Few-shot in-context learning for complex reasoning tasks (commonsense, factual, symbolic, arithmetic).

Inputs: Natural language query Q

Outputs: Predicted answer A supported by a reasoning chain

Pipeline Flow

Exemplar Construction (Offline)
CoK Generation (Inference)
F2-Verification (Inference)
Rethinking (Inference Loop)

System Modules

CoK Prompt Construction

Concatenates labeled exemplars (containing manually annotated triples and explanations) with the test query.

Model or implementation: Prompt Template

Reasoning Generator

Generates evidence triples (CoK-ET), explanation hints (CoK-EH), and a candidate answer.

Model or implementation: LLM (e.g., text-davinci-003)

Factuality Verifier (Verification)

Checks generated triples against a Knowledge Base using exact matching or TransR embedding scores.

Model or implementation: Exact Match / TransR energy function

Faithfulness Verifier (Verification)

Measures semantic similarity between the explanation/evidence and the final answer.

Model or implementation: SimCSE

Rethinking Controller

Compares aggregated reliability score against threshold theta. If low, injects correct triples from KB and loops back to generation.

Model or implementation: Algorithm 1

Novel Architectural Elements

Integration of structured 'Evidence Triples' directly into the prompting output format.
Post-hoc F2-Verification loop that triggers a specific 'Rethinking' step with knowledge injection.

Modeling

Base Model: text-davinci-003 (InstructGPT)

Comparison to Prior Work

vs. CoT: CoK enforces structured output (triples) rather than free-text only, reducing ambiguity.
vs. Self-Consistency: CoK uses explicit knowledge verification (F2) rather than just statistical consensus.
vs. REACT [not cited in paper]: CoK generates evidence as part of the chain rather than calling external tools iteratively as actions; CoK is a prompting strategy with a verification loop.

Limitations

Relies on the availability and coverage of an external Knowledge Base for verification.
Manual annotation is required for the initial CoK exemplars (evidence triples).
Inference cost is higher due to the verification and potential rethinking loops.

Reproducibility

Code: https://github.com/wjn1996/Chain-of-Knowledge

📊 Experiments & Results

Evaluation Setup

Few-shot learning on reasoning benchmarks.

Benchmarks:

CommonsenseQA (Commonsense Reasoning)
StrategyQA (Factual Reasoning)
GSM8K (Arithmetic Reasoning)
Last Letter Concatenation (Symbolic Reasoning)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CoK consistently outperforms standard prompting methods across various reasoning domains.
CommonsenseQA	Accuracy	70.3	79.7	+9.4
StrategyQA	Accuracy	66.4	72.5	+6.1
GSM8K	Accuracy	46.9	49.6	+2.7

Main Takeaways

CoK significantly improves performance on commonsense and factual reasoning tasks where external knowledge is crucial.
The rethinking mechanism (F2-Verification) provides substantial gains over static CoK prompting, confirming the value of the verification loop.
CoK can be combined with self-consistency for further improvements, acting as a plug-and-play module.

📚 Prerequisite Knowledge

Prerequisites

In-Context Learning (ICL)
Chain-of-Thought (CoT) prompting
Knowledge Graph triples (Subject, Relation, Object)

Key Terms

CoK-ET: Evidence Triples—structured knowledge triples (subject, relation, object) generated by the LLM to support reasoning.

CoK-EH: Explanation Hints—textual explanations accompanying the evidence triples.

F2-Verification: A mechanism to estimate reliability based on Factuality (matching evidence to a KB) and Faithfulness (consistency between evidence and answer).

SimCSE: A sentence embedding model used here to measure semantic similarity between the reasoning chain and the final answer for faithfulness verification.

TransR: A knowledge graph embedding method used to score the validity of triples not found exactly in the knowledge base.

Hallucination: Generated content that is nonsensical or unfaithful to the provided source content/facts.

Self-consistency: A decoding strategy that samples multiple reasoning paths and selects the most consistent answer.