Path-of-Thoughts: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

📝 Paper Summary

Relational Reasoning Neuro-symbolic AI LLM Reasoning

Path-of-Thoughts improves relational reasoning by using a single LLM call to extract a graph, identifying multiple reasoning paths between queried entities, and aggregating results to mitigate hallucinations and ambiguity.

Core Problem

LLMs struggle with multi-hop relational reasoning (e.g., kinship, spatial) due to shallow reasoning and hallucinations, while existing neuro-symbolic methods are brittle to extraction errors and require complex, task-specific translation.

Why it matters:

Multi-hop reasoning is essential for planning, navigation, and logic tasks where LLMs typically fail compared to symbolic solvers.
Current neuro-symbolic approaches often require many LLM calls or highly specialized symbolic modules that break when the LLM makes minor extraction errors.
Pure prompting methods (CoT) often get distracted by irrelevant context in long stories.

Concrete Example: In a story where 'A is west of B' and 'C is north of A', an LLM might hallucinate a direct relation or get confused by irrelevant details. Current symbolic methods might extract a wrong fact and fail completely. PoT extracts a graph and finds multiple paths (e.g., A->B->C) to verify the relationship, mitigating single-point failures.

Key Novelty

Path-of-Thoughts (PoT)

Decomposes reasoning into three stages: graph extraction, path identification, and reasoning, using a single LLM call for extraction.
Mitigates LLM errors by finding *multiple* independent reasoning paths between entities in the extracted graph, rather than relying on a single chain.
Uses the graph structure to filter out irrelevant context, passing only relevant reasoning chains to the final solver (LLM or symbolic).

Architecture

The 3-stage pipeline of Path-of-Thoughts: (1) Graph Extraction from story, (2) Path Identification between query nodes, (3) Reasoning to produce the answer.

Evaluation Highlights

Surpasses state-of-the-art baselines by up to 21.3% on benchmark datasets like CLUTRR and StepGame.
Achieves higher accuracy than Chain-of-Thought (CoT) and CoT-SC on complex Chinese kinship tasks involving over 500 relation types.
Demonstrates superior robustness to LLM extraction errors by successfully reasoning even when the initial graph contains noise, thanks to multi-path validation.

Breakthrough Assessment

7/10

Strong empirical results (+21%) and a practical approach to checking LLM hallucinations via graph path consistency. The single-call extraction is efficient, though the core novelty is an evolutionary step in neuro-symbolic reasoning rather than a paradigm shift.

⚙️ Technical Details

Problem Definition

Setting: Relational reasoning where a story S (context + question) is given, and the goal is to infer a relation 'a' between two entities from a pre-defined set R.

Inputs: Textual story S containing entities, relations, and a query.

Outputs: Target relation a (or set of relations) between the queried entities.

Pipeline Flow

Graph Extraction (LLM converts text to graph)
Path Identification (Algorithm finds paths between query nodes)
Reasoning (LLM or Symbolic Solver infers answer from paths)

System Modules

Graph Extractor

Extract entities, relations, and the specific query from the text story into a structured format.

Model or implementation: LLM (e.g., GPT-4, Llama-3-8B-Instruct)

Path Finder

Identify all valid sequences of relations connecting the source and target entities.

Model or implementation: Standard Path-finding Algorithm (e.g., DFS/BFS)

Reasoner

Infer the final relationship based on the identified paths.

Model or implementation: LLM (PoT-LLM) OR Symbolic Solver (PoT-Symbolic / CLINGO)

Novel Architectural Elements

Three-stage decoupling: Extraction -> Path Finding -> Reasoning
Path-centric context filtering: Only the specific paths relevant to the query are sent to the reasoning module, removing noise.

Modeling

Base Model: Evaluated with multiple backbones: GPT-3.5-Turbo, GPT-4, Llama-3-8B-Instruct, Llama-3-70B-Instruct, Mistral-7B-Instruct

Comparison to Prior Work

vs. LLM-ASP: PoT adds a 'Path Identification' layer that filters noise and allows multiple reasoning paths, making it more robust to extraction errors.
vs. CoT/CoT-SC: PoT explicitly structures the context as a graph and filters irrelevant information before reasoning.
vs. CoS: PoT explicitly extracts the graph first, rather than asking the LLM to translate and reason simultaneously.

Limitations

Dependency on the initial extraction quality; if the graph is completely disconnected or missing key edges, reasoning fails.
Symbolic solver variants require domain-specific rule definitions (ASP modules), which are hard to write for complex domains like Chinese kinship (>500 relations).
Graph extraction overhead might be higher than simple IO prompting for very simple tasks.

📊 Experiments & Results

Evaluation Setup

Relational reasoning on text stories requiring multi-hop inference.

Benchmarks:

StepGame (Spatial reasoning)
CLUTRR (Kinship reasoning)
SPARTUN (Spatial reasoning (topological))
Chinese Kinship (Kinship reasoning) [New]

Metrics:

Accuracy (checking if predicted relation matches ground truth)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on StepGame (Spatial Reasoning) showing significant gains over baselines, especially at higher hop counts (k=10).
StepGame (k=10)	Accuracy	48.3	69.6	+21.3
StepGame (k=10)	Accuracy	65.1	69.6	+4.5
Performance on CLUTRR (Kinship Reasoning) demonstrating robustness.
CLUTRR	Accuracy	64.3	78.4	+14.1
Results on the newly constructed Chinese Kinship dataset, which is highly complex.
Chinese Kinship	Accuracy	70.0	83.6	+13.6

Experiment Figures

Comparison of different methods (Standard, CoT, PoT) on StepGame as the number of hops (k) increases.

Main Takeaways

Graph-based filtering significantly aids LLMs: By extracting paths and removing irrelevant context, PoT allows even the LLM-based reasoner (PoT-LLM) to outperform standard CoT.
Neuro-symbolic synergy: PoT-Symbolic generally performs best on tasks where logical rules are easy to define (e.g., StepGame, CLUTRR), validating the hybrid approach.
Robustness to length: The performance gap between PoT and baselines widens as the number of reasoning hops increases (e.g., StepGame k=10), showing PoT handles long-context complexity better.

📚 Prerequisite Knowledge

Prerequisites

Basic graph theory (nodes, edges, paths)
Prompt engineering techniques (CoT, Few-shot)
Answer Set Programming (ASP) concepts (for the symbolic solver variant)

Key Terms

ASP: Answer Set Programming—a declarative programming paradigm oriented towards complex combinatorial search problems, used here as a symbolic solver.

CLUTRR: A diagnostic benchmark dataset for testing the ability of systems to learn kinship reasoning rules from examples.

Neuro-symbolic: AI systems combining neural networks (like LLMs) with symbolic reasoning (logic, graphs) to improve robustness and interpretability.

CoT: Chain-of-Thought—a prompting strategy where the model is encouraged to generate intermediate reasoning steps.

Graph Extraction: The process of converting unstructured text into a structured graph representation with entities as nodes and relations as edges.

Path Identification: Finding sequences of edges (relations) connecting two specific nodes (entities) in a graph.