Plan of Knowledge: Retrieval-Augmented Large Language Models for Temporal Knowledge Graph Question Answering

📝 Paper Summary

Temporal Knowledge Graph Question Answering (TKGQA) Retrieval-Augmented Generation (RAG)

PoK improves temporal question answering by decomposing complex queries into structured sub-objectives (Retrieve, Rank, Reason) and retrieving facts using a contrastive time-aware embedding model.

Core Problem

LLMs struggle with complex multi-hop temporal reasoning, often suffering from hallucinations due to implicit temporal constraints and a lack of specific temporal knowledge.

Why it matters:

Standard RAG methods prioritize semantic similarity but overlook temporal consistency (e.g., retrieving facts from the wrong year)
Path-based reasoning on Temporal Knowledge Graphs is computationally complex and often fails to find valid paths for multi-hop queries
LLMs frequently generate factually incorrect timelines or confuse temporal order (e.g., 'before' vs. 'after') when reasoning implicitly

Concrete Example: For the question 'After the Danish Ministry, who was the first to visit Iraq?', standard Chain-of-Thought might hallucinate a visit date or person. Text-based RAG might retrieve a visit from 2016 when the question implies a sequence starting in 2003, failing to respect the 'after' constraint.

Key Novelty

Plan of Knowledge (PoK) Framework

Decomposes questions into executable sub-objectives using three specific operators: Retrieve (fetch facts), Rank (sort chronologically), and Reason (infer answer)
Constructs a Temporal Knowledge Store (TKS) where facts are encoded as text with time-aware contrastive learning to align questions with temporally relevant facts

Architecture

The overall architecture of the PoK framework, detailing the pipeline from question input to answer generation.

Evaluation Highlights

Achieves 77.9% Hits@1 on the MultiTQ dataset, outperforming the previous state-of-the-art RTQA by 1.8%
Surpasses GenTKGQA on the TimeQuestions dataset with 83.2% Hits@1 compared to 58.4%
Demonstrates massive gains on complex questions in Timeline-ICEWS, improving Hits@1 from 37.6% (GPT-4o) to 68.3%

Breakthrough Assessment

7/10

Strong empirical results (+56% on some benchmarks) and a logical decomposition framework. The approach effectively bridges structured KG reasoning and LLM generation, though it relies on standard components (LLaMA/Qwen) arranged novelly.

⚙️ Technical Details

Problem Definition

Setting: Temporal Knowledge Graph Question Answering (TKGQA) using Retrieval-Augmented Generation

Inputs: Natural language question q and a Temporal Knowledge Graph G = {E, P, T, F}

Outputs: Answer a (entity or timestamp)

Pipeline Flow

Plan Generation: LLM decomposes question into sub-objectives (Retrieve, Rank, Reason)
Temporal Retrieval: Contrastive retriever fetches relevant facts from TKS based on sub-objectives
Re-ranking: Filter and sort facts based on semantic and temporal relevance
Reasoning: LLM generates final answer using the ordered evidence

System Modules

Planner

Decompose complex questions into executable steps

Model or implementation: gpt-4o

Temporal Retriever (Retrieval)

Retrieve semantically and temporally aligned facts

Model or implementation: Qwen3-Embedding-0.7B (Fine-tuned)

Re-ranker (Retrieval)

Refine retrieval by filtering invalid timeframes

Model or implementation: Algorithmic (Time-filtering function)

Reasoner

Generate final answer from plan and facts

Model or implementation: LLaMA2-Chat-7B (Fine-tuned)

Novel Architectural Elements

Plan of Knowledge (PoK) module that explicitly restricts LLM reasoning to three operators: Retrieve, Rank, and Reason
Dual-constraint retrieval scoring that linearly combines semantic similarity (vector dot product) with explicit temporal validity (time difference penalty)

Modeling

Base Model: LLaMA2-Chat-7B (Reasoning), Qwen3-Embedding-0.7B (Retrieval)

Training Method: Supervised Fine-Tuning (Reasoner) and Contrastive Fine-Tuning (Retriever)

Objective Functions:

Purpose: Optimize retriever to distinguish temporally correct facts from similar but incorrect ones.

Formally: InfoNCE loss minimizing distance between query and correct fact while maximizing distance to hard negatives (time/entity corrupted).
Purpose: Optimize reasoner to generate correct answers given evidence.

Formally: Standard causal language modeling loss maximizing likelihood of answer a given question q and facts f.

Adaptation: Full fine-tuning (implied by 'Fine-tuned for 2 epochs')

Trainable Parameters: Soft prompt vectors P in retriever; Model weights in LLaMA2 reasoner

Training Data:

MultiTQ: 20% of training set used for fine-tuning due to size
Negatives for retrieval: 3 hard negatives per positive (time-incorrect, content-incorrect, both-incorrect)

Key Hyperparameters:

epochs: 2
temperature_tau: 0.01 (contrastive loss)
balance_coefficient_mu: 0.2 (re-ranking)
+ 1 more
retrieved_facts_k: 20

Compute: 2 NVIDIA A6000 GPUs

Comparison to Prior Work

vs. RTQA: PoK uses structured operators (Rank/Retrieve) rather than generic sub-questions, and explicitly fine-tunes the retriever for time awareness
vs. Naive RAG: PoK incorporates a temporal re-ranking step and time-aware contrastive embedding, whereas Naive RAG relies only on semantic similarity
vs. Chain-of-Thought (CoT) [not cited in paper as baseline but discussed]: PoK separates planning from execution to prevent hallucination in intermediate steps, whereas CoT does both in one pass

Limitations

Retrieval of complex temporal facts remains challenging if facts are missing from the TKG (incomplete graph assumption)
Reliance on GPT-4o for the planning stage introduces a dependency on a closed-source model for the decomposition step
Fine-tuning requires specific hard negative construction which adds complexity to the data preparation pipeline

Reproducibility

Code availability is not provided. The paper describes the prompt templates for planning and reasoning. It specifies the base models (Qwen3-Embedding, LLaMA2-Chat-7B) and key hyperparameters.

📊 Experiments & Results

Evaluation Setup

TKGQA on four datasets with varying complexity and temporal granularity

Benchmarks:

MultiTQ (Large-scale TKGQA (multiple granularities))
TimeQuestions (Wikidata-based TKGQA (year-level))
Timeline-ICEWS (Complex temporal queries on ICEWS)
Timeline-CronQuestions (Complex temporal queries on CronQuestions)

Metrics:

Hits@1
Hits@10 (implied context, mainly Hits@1 reported)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on MultiTQ dataset showing PoK outperforming both traditional TKGQA methods and recent LLM-based approaches.
MultiTQ	Hits@1	76.5	77.9	+1.4
MultiTQ	Hits@1	70.2	77.9	+7.7
MultiTQ	Hits@1	37.9	77.9	+40.0
Results on TimeQuestions dataset, where PoK significantly outperforms graph-based and generative baselines.
TimeQuestions	Hits@1	58.4	83.2	+24.8
TimeQuestions	Hits@1	60.5	83.2	+22.7
Ablation studies validating the necessity of each component in the PoK framework.
MultiTQ	Hits@1	71.3	77.9	+6.6
MultiTQ	Hits@1	32.0	77.9	+45.9
Performance on complex Timeline datasets showing massive improvements over GPT-4o.
Timeline-ICEWS	Hits@1 (Complex)	37.6	68.3	+30.7

Experiment Figures

Radar charts comparing retrieval performance (Answer Coverage) of different retrievers on MultiTQ.

Line graphs showing Hits@1 performance and Answer Coverage as the number of retrieved facts increases.

Main Takeaways

Explicit planning (Retrieve/Rank/Reason) significantly reduces hallucination compared to direct generation or standard CoT.
Time-aware contrastive fine-tuning of the embedding model is crucial; standard semantic retrieval (Naive RAG) performs poorly on temporal questions.
LLMs like LLaMA-2 benefit immensely from fine-tuning on the reasoning task; zero-shot performance is very low (18.5% vs 77.9% with PoK on MultiTQ).
The 'Rank' operator is essential for ordinal questions (e.g., 'who was first'); removing it causes significant performance drops.

📚 Prerequisite Knowledge

Prerequisites

Knowledge of Temporal Knowledge Graphs (quadruples: subject, predicate, object, timestamp)
Retrieval-Augmented Generation (RAG) concepts
Contrastive Learning (InfoNCE loss)

Key Terms

TKGQA: Temporal Knowledge Graph Question Answering—answering questions that require reasoning about when events happened or the order of events

Quadruple: A data unit in a temporal knowledge graph consisting of (subject, predicate, object, timestamp)

Hits@1: A metric measuring the percentage of times the correct answer is the top-1 predicted output

InfoNCE: A contrastive loss function used to maximize similarity between positive pairs (question and correct fact) while minimizing similarity with negative pairs

Hard Negatives: Incorrect training examples that are very similar to the truth (e.g., same event but wrong year) to force the model to learn fine-grained distinctions

TKS: Temporal Knowledge Store—a dense vector index of all temporal facts converted into text templates

Chain-of-Thought: A prompting technique where the model produces intermediate reasoning steps before the final answer