← Back to Paper List

Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs

MV Nguyen, L Luo, F Shiri, D Phung, YF Li, TT Vu…
Department of Data Science & AI, Monash University, VinAI Research, Vietnam
arXiv, 2/2024 (2024)
Reasoning KG QA Factuality

📝 Paper Summary

LLM Reasoning Evaluation Knowledge Graphs Chain-of-Thought
This paper evaluates whether Large Language Models (LLMs) reason faithfully by grounding their Chain-of-Thought steps in Knowledge Graphs to check for factual correctness and logical coherence.
Core Problem
Existing evaluations focus solely on final answer accuracy, ignoring whether the intermediate Chain-of-Thought (CoT) reasoning is actually correct or merely a hallucinated path to the right answer.
Why it matters:
  • High answer accuracy can mask poor reasoning logic or lucky guesses, making LLMs unreliable for critical tasks requiring explainability
  • Users cannot trust an LLM's explanation if the model frequently hallucinates facts or logical connections, even when the final prediction is correct
Concrete Example: When answering 'Who is the brother of Justin Bieber?', an LLM might correctly answer 'Jaxon Bieber' but provide a hallucinated reasoning path involving incorrect relationships (e.g., falsely claiming Justin is the father of Jaxon) rather than the correct path via their shared parent.
Key Novelty
Reasoning Path Verification via Knowledge Graph Grounding
  • Discriminative Evaluation: Tests if LLMs can distinguish between valid reasoning paths and invalid ones (containing factual errors, incoherence, or irrelevance) when explicitly presented with options
  • Generative Evaluation: Parses unstructured LLM-generated CoT into structured paths, maps them to Knowledge Graph triples, and verifies if they form a valid chain connecting the question to the answer
Evaluation Highlights
  • Large disparity between answer accuracy and reasoning faithfulness: e.g., on 2WikiMultihopQA, Llama-2-70b-chat achieves 43.15% answer accuracy but only 19.50% reasoning accuracy.
  • Bigger models widen the gap: As model size increases (e.g., Llama-2-7b to 70b), answer accuracy improves significantly (+14.4%), but reasoning accuracy improves much less (+6.7%), suggesting reliance on memorized answers rather than reasoning.
  • Discriminative tests show LLMs have sufficient knowledge to identify factual errors (high AUC > 90% for detecting errors) but fail to generate coherent chains themselves.
Breakthrough Assessment
7/10
Provides a crucial reality check for CoT reasoning by quantifying the 'reasoning-answer gap.' The methodology of grounding free-form CoT into KGs is a strong contribution to interpretability.
×