Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs

📝 Paper Summary

LLM Reasoning Evaluation Knowledge Graphs Chain-of-Thought

This paper evaluates whether Large Language Models (LLMs) reason faithfully by grounding their Chain-of-Thought steps in Knowledge Graphs to check for factual correctness and logical coherence.

Core Problem

Existing evaluations focus solely on final answer accuracy, ignoring whether the intermediate Chain-of-Thought (CoT) reasoning is actually correct or merely a hallucinated path to the right answer.

Why it matters:

High answer accuracy can mask poor reasoning logic or lucky guesses, making LLMs unreliable for critical tasks requiring explainability
Users cannot trust an LLM's explanation if the model frequently hallucinates facts or logical connections, even when the final prediction is correct

Concrete Example: When answering 'Who is the brother of Justin Bieber?', an LLM might correctly answer 'Jaxon Bieber' but provide a hallucinated reasoning path involving incorrect relationships (e.g., falsely claiming Justin is the father of Jaxon) rather than the correct path via their shared parent.

Key Novelty

Reasoning Path Verification via Knowledge Graph Grounding

Discriminative Evaluation: Tests if LLMs can distinguish between valid reasoning paths and invalid ones (containing factual errors, incoherence, or irrelevance) when explicitly presented with options
Generative Evaluation: Parses unstructured LLM-generated CoT into structured paths, maps them to Knowledge Graph triples, and verifies if they form a valid chain connecting the question to the answer

Evaluation Highlights

Large disparity between answer accuracy and reasoning faithfulness: e.g., on 2WikiMultihopQA, Llama-2-70b-chat achieves 43.15% answer accuracy but only 19.50% reasoning accuracy.
Bigger models widen the gap: As model size increases (e.g., Llama-2-7b to 70b), answer accuracy improves significantly (+14.4%), but reasoning accuracy improves much less (+6.7%), suggesting reliance on memorized answers rather than reasoning.
Discriminative tests show LLMs have sufficient knowledge to identify factual errors (high AUC > 90% for detecting errors) but fail to generate coherent chains themselves.

Breakthrough Assessment

7/10

Provides a crucial reality check for CoT reasoning by quantifying the 'reasoning-answer gap.' The methodology of grounding free-form CoT into KGs is a strong contribution to interpretability.

⚙️ Technical Details

Problem Definition

Setting: Multi-hop Question Answering (QA) where a model generates a Chain-of-Thought (CoT) explanation followed by an answer

Inputs: Natural language question q and relevant Knowledge Graph (KG) context

Outputs: A sequence of reasoning steps S and a final answer a

Pipeline Flow

Prompting (Structured CoT Generation)
Step-to-Triple Retrieval
Path Construction
Path Validation

System Modules

CoT Generator

Generate reasoning steps in a structured format (e.g., 'Entity1 | Relation | Entity2')

Model or implementation: Various LLMs (e.g., Llama-2-70b-chat, GPT-3.5-Turbo)

Triple Retriever

Map each generated natural language step to the most similar triple in the KG

Model or implementation: Sentence-BERT (all-MiniLM-L6-v2)

Path Validator

Check if the sequence of retrieved triples forms a valid, coherent path from question to answer

Model or implementation: Deterministic Logic

Novel Architectural Elements

Embedding-based grounding of unstructured CoT steps into formal KG triples using combined semantic similarity and fuzzy entity matching
Evaluation pipeline treating reasoning validity as a path connectivity problem in KGs rather than text overlap

Modeling

Base Model: Evaluated on: Llama-2-7b-chat, Llama-2-13b-chat, Llama-2-70b-chat, Vicuna-13b-v1.5, Mistral-7B-Instruct-v0.2, GPT-3.5-Turbo, GPT-4

Reproducibility

Code: https://github.com/MinhVuong2000/LLMReasonCert

📚 Prerequisite Knowledge

Prerequisites

Understanding of Knowledge Graphs (triples, paths)
Chain-of-Thought (CoT) prompting
Embeddings and vector similarity search

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

KG: Knowledge Graph—a structured representation of knowledge using entities (nodes) and relations (edges)

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

Reasoning Path: A sequence of KG triples connecting a start entity (from the question) to an end entity (the answer)

Discriminative Evaluation: A multiple-choice setting where the model must identify the correct reasoning path among invalid distractors

Generative Evaluation: A setting where the model generates the reasoning text freely, which is then parsed and checked against the KG

Needleman-Wunsch algorithm: A sequence alignment algorithm used here to measure the edit distance between the generated reasoning path and the ground truth path

Sentence-BERT: A modification of the BERT network that uses siamese networks to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity

Triple Verbalization: Converting a KG triple (Subject, Relation, Object) into a natural language sentence for embedding

AUC: Area Under the Curve—a performance metric for classification tasks, where 1.0 is perfect and 0.5 is random guessing