Follow the Path: Reasoning over Knowledge Graph Paths to Improve LLM Factuality

📝 Paper Summary

Factuality Reasoning

The authors fine-tune small language models on reasoning traces distilled from large models but grounded in verifiably correct Knowledge Graph paths, significantly improving factuality in multi-hop question answering.

Core Problem

Large reasoning models generate 'thinking' traces that improve performance but may contain factual hallucinations, and smaller models struggle to perform complex multi-hop reasoning reliably.

Why it matters:

Factual consistency is mandatory for critical real-world applications
Distilled reasoning traces from larger models often lack a verification mechanism, potentially propagating hallucinations to smaller student models
Current reasoning techniques prioritize problem-solving logic (like math) over verifiable factual accuracy in open-domain QA

Concrete Example: A reasoning model might correctly answer 'Pablo Picasso' but hallucinate the wrong intermediate reasoning steps about his art movement associations. Without grounding, a fine-tuned student model learns this flawed logic.

Key Novelty

fs1 (Factual Simple Test-time Scaling)

Extract reasoning traces from large models (e.g., DeepSeek-R1) for complex questions.
Condition these traces on linearized Knowledge Graph paths retrieved from Wikidata to enforce factual accuracy in the reasoning steps.
Fine-tune smaller standard LLMs (e.g., Qwen2.5-Instruct) on these verifiable, grounded traces to induce reliable reasoning capabilities.

Architecture

Conceptual overview of the fs1 method: extraction of raw traces, grounding with KG paths, and fine-tuning student models.

Evaluation Highlights

+6 to +14 absolute points improvement (pass@16) on Qwen2.5-32B across six benchmarks compared to standard instruction-tuned baselines.
Smaller models (e.g., 0.5B) show massive relative gains (up to +74.6% on WebQSP) from grounded fine-tuning, while larger models see diminishing returns.
Outperforms baselines specifically on complex questions requiring 3+ hops of reasoning and on numerical answer types.

Breakthrough Assessment

7/10

Strong empirical results demonstrating that KG grounding effectively cleans reasoning traces for distillation. While the method combines existing components (distillation + KGs), the analysis of scaling and complexity is valuable.

⚙️ Technical Details

Problem Definition

Setting: Multi-hop Question Answering (mQA) where a model must synthesize evidence from multiple sources.

Inputs: Natural language question x

Outputs: Answer y and reasoning trace, ideally matching gold standard entity

Pipeline Flow

Reasoning Trace Extraction (Teacher Models)
Knowledge Graph Grounding (Path Retrieval)
Trace Refinement (Teacher Models + KG Paths)
Fine-tuning (Student Models)
Inference (Parallel Sampling)

System Modules

Teacher Reasoning (Data Generation)

Generate initial reasoning traces for complex questions

Model or implementation: DeepSeek-R1 (671B) and QwQ-32B

KG Path Retriever (Data Generation)

Find verifiable paths connecting question entities to answers

Model or implementation: SPARQL engine over Wikidata

Trace Refiner (Data Generation)

Regenerate reasoning traces conditioned on factual KG paths

Model or implementation: DeepSeek-R1 and QwQ-32B

Student Model

Answer new questions using learned reasoning patterns

Model or implementation: Qwen2.5-Instruct (0.5B to 32B)

Novel Architectural Elements

Conditioning reasoning trace generation on linearized KG paths during the distillation phase to create a 'factual' training set

Modeling

Base Model: Qwen2.5-Instruct (sizes: 0.5B, 1.5B, 3B, 7B, 14B, 32B)

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Minimize negative log-likelihood of target tokens.

Formally: standard cross-entropy loss on the reasoning trace and final answer.

Training Data:

3.4K raw traces (rt) from CWQ dev set
3.9K KG-enhanced traces (fs1) from CWQ dev set
Only correct traces retained

Key Hyperparameters:

epochs: 5
sequence_length: 8192
batch_size: 16
+ 3 more
learning_rate: 1e-5
lr_schedule: cosine (5% warmup)
weight_decay: 1e-4

Compute: Not reported in the paper

Comparison to Prior Work

vs. KG-RAG: fs1 internalizes the reasoning capability via fine-tuning rather than relying solely on retrieval at inference time
vs. Simple Test-Time Scaling: fs1 explicitly grounds the 'thinking' process in verifiable facts during training, rather than just scaling up ungrounded reasoning
vs. RoG: fs1 distills general instruction models on traces, whereas RoG typically focuses on path generation models
+ 1 more
vs. Step-Back Prompting [not cited in paper]: fs1 uses structural KG paths for grounding, whereas Step-Back uses abstract principles

Limitations

Improvement is scale-dependent; larger models (32B) show less relative gain than smaller models (0.5B) from KG grounding.
Requires existence of relevant paths in the Knowledge Graph; performance depends on KG coverage.
KG path extraction relies on knowing the gold answer during training data creation (not inference), limiting data scalability to labeled sets.
Inference cost increases due to the generation of reasoning traces compared to direct answering.

Reproducibility

Code: https://github.com/jjzha/fs1

publicly available (https://github.com/jjzha/fs1). Code, 3.4K raw traces, 3.9K KG-enhanced traces, and models are released. Hardware details for training not explicitly reported in main text.

📊 Experiments & Results

Evaluation Setup

Multi-hop Question Answering (mQA) across open-domain benchmarks.

Benchmarks:

ComplexWebQuestions (CWQ) (Complex mQA)
Mintaka (Complex mQA)
WebQSP (mQA)
GrailQA (mQA)
ExaQT (mQA)
SimpleQA (Factuality benchmark)

Metrics:

pass@k (k=1, 2, 4, 8, 16)
LLM-as-a-Judge Accuracy (Llama-3.3-70B)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Parallel scaling (pass@16) results for Qwen2.5-32B showing fs1's superiority over baselines.
CWQ	pass@16	0.38	0.54	+0.16
SimpleQA	pass@16	0.14	0.20	+0.06
Single-pass (pass@1) accuracy comparisons across model scales showing smaller models benefit most.
WebQSP	Accuracy (LLM-Judge)	0.181	0.316	+0.135
CWQ	Accuracy (LLM-Judge)	0.135	0.209	+0.074
CWQ	Accuracy (LLM-Judge)	0.452	0.463	+0.011
Baseline comparisons show o3-mini dominates, but fs1 improves open-weights models.
Mintaka	Accuracy (LLM-Judge)	0.774	0.428	-0.346

Experiment Figures

Pass@k performance curves (k=1 to 16) for Qwen2.5-32B on six benchmarks.

Breakdown of model performance by question difficulty (hops), answer type, and domain.

Main Takeaways

Grounding reasoning traces with Knowledge Graph paths (fs1) consistently improves factual accuracy, especially when using parallel sampling (test-time scaling).
Smaller models (0.5B parameters) gain significantly more from this fine-tuning than larger models (32B), likely because larger models already possess strong internal parametric knowledge.
The method is particularly effective for 'hard' questions requiring 3 or more reasoning hops and for numerical answer types.
fs1-tuned models are more robust on specific domains like video games, geography, and politics compared to standard chain-of-thought.

📚 Prerequisite Knowledge

Prerequisites

Language Model Distillation
Knowledge Graphs (SPARQL, Entities, Relations)
Chain-of-Thought Reasoning
Test-Time Scaling (Best-of-N sampling)

Key Terms

fs1: Factual Simple Test-time Scaling—the proposed method of fine-tuning models on reasoning traces grounded by Knowledge Graph paths.

rt: Raw Reasoning Traces—traces extracted directly from large reasoning models (like DeepSeek-R1) without external grounding.

pass@k: A metric measuring the probability that at least one correct answer exists in k generated samples.

KG path: A sequence of entities and relations from a Knowledge Graph (e.g., Wikidata) connecting the question subject to the answer.

test-time scaling: Improving model performance by increasing computation during inference, often by generating multiple samples and selecting the best one.

LLM-as-a-judge: Using a strong LLM (e.g., Llama-3.3-70B) to evaluate whether a generated answer is semantically equivalent to the gold standard.

SFT: Supervised Fine-Tuning—training a pre-trained model on a specific dataset of inputs and targets.

linearized graph: Representing a graph structure (nodes and edges) as a sequence of text tokens so an LLM can process it.