TimeR4: Time-aware Retrieval-Augmented Large Language Models for Temporal Knowledge Graph Question Answering

📝 Paper Summary

Graph-based RAG pipeline

TimeR4 enhances LLMs for temporal question answering by rewriting implicit questions into explicit ones with retrieved facts and using a time-aware retrieval-rerank pipeline.

Core Problem

LLMs struggle with temporal reasoning because they hallucinate on implicit time questions (e.g., 'after the ministry') and standard retrievers miss time constraints.

Why it matters:

Standard retrieval methods (like BM25) focus only on semantic matching, often retrieving facts with incorrect timestamps that mislead the LLM
Implicit temporal questions (lacking specific dates) cause severe hallucinations in LLMs, which cannot infer the hidden timeline without external knowledge
Existing TKGQA (Temporal Knowledge Graph Question Answering) methods using graph embeddings fail to handle the complex semantic nuance of natural language questions

Concrete Example: For the question 'After the Danish Ministry, who was the first to visit Iraq?', an LLM might guess incorrectly. TimeR4 retrieves the fact (Danish Ministry, visit, Iraq, 2016-01-05), rewrites the question to 'After 2016-01-05...', and then retrieves the correct answer 'Jack Straw' (visit date 2016-01-06) while filtering out irrelevant visits like Evan Bayh's in 2016-01-04.

Key Novelty

Retrieve-Rewrite-Retrieve-Rerank Framework (TimeR4)

Rewrites implicit temporal questions by retrieving background facts (e.g., event dates) and asking an LLM to substitute them into the query as explicit timestamps
Trains a specific Time-Aware Retriever using contrastive learning with negatives that have perturbed timestamps, ensuring the embedding model learns time sensitivity
Applies a hard temporal filter during reranking to explicitly penalize retrieved facts that violate the question's time constraints (e.g., filtering events before a 'start' date)

Architecture

The four-module architecture of TimeR4: Fact Retrieval -> Rewriting -> Time-aware Retrieval -> Reasoning

Evaluation Highlights

+47.8% improvement in Hits@1 on the MultiTQ dataset compared to the ChatGPT-based baseline ARI
+22.5% relative improvement on TimeQuestions dataset compared to the best baseline TwiRGCN
Achieves 72.8% Hits@1 on MultiTQ, significantly outperforming LLaMA2 (18.5%) and ChatGPT (10.2%) in zero-shot settings

Breakthrough Assessment

8/10

Significant performance jumps on standard TKGQA benchmarks. Effectively addresses the specific 'implicit time' problem in RAG, though the scope is limited to structured temporal knowledge graphs.

⚙️ Technical Details

Problem Definition

Setting: Temporal Knowledge Graph Question Answering (TKGQA) over a graph G={E,P,T,F}

Inputs: Natural language question q (potentially with implicit time constraints)

Outputs: Answer a (entity or timestamp) based on relevant quadruples f=(s,p,o,t)

Pipeline Flow

Group 1: Query Processing: Question q → FKS Retrieval → Rewriting with explicit time q*
Group 2: Time-Aware Retrieval: q* → TKS Retrieval → Time-Filtering Rerank → Top-k Facts
Group 3: Reasoning: q* + Top-k Facts → LLM → Final Answer

System Modules

Fact Retriever (Query Processing)

Retrieve background facts to resolve implicit time references in the question

Model or implementation: SentenceBERT (standard)

Rewriter (Query Processing)

Convert implicit temporal questions into explicit ones using retrieved background facts

Model or implementation: GPT-3.5-turbo (via API)

Time-Aware Retriever (Time-Aware Retrieval)

Retrieve facts matching both semantics and time constraints

Model or implementation: SentenceBERT (fine-tuned via contrastive learning)

Reranker (Time-Aware Retrieval)

Filter and re-score facts based on hard time constraints

Model or implementation: Analytic function (Equation 9/10)

Reasoning LLM

Generate the final answer using the refined context

Model or implementation: LLaMA2-Chat-7B (fine-tuned)

Novel Architectural Elements

Two-stage retrieval pipeline: First retrieving solely to rewrite the query (resolve implicit time), then retrieving again to answer
Dual Knowledge Stores (FKS vs TKS): Maintaining separate vector indices for semantic lookup (FKS) vs. time-aware lookup (TKS)

Modeling

Base Model: LLaMA2-Chat-7B (Reasoning), SentenceBERT (Retrieval)

Training Method: Supervised Fine-Tuning (LLM) + Contrastive Learning (Retriever)

Objective Functions:

Purpose: Train Time-Aware Retriever to distinguish correct facts from time-corrupted or entity-corrupted negatives.

Formally: Contrastive loss L = sum [w_p * Y * exp(phi) + w_n * (1-Y) * exp(1-phi)] where Y=1 for positive pairs.
Purpose: Optimize Reasoning LLM to generate correct answers given retrieved context.

Formally: Standard causal language modeling loss maximizing P(a | q*, f+).

Adaptation: Fine-tuning of LLaMA2 backbone

Trainable Parameters: Full fine-tuning (implied by 'fine-tune open-source LLMs')

Training Data:

MultiTQ: 386,787 train examples (used 20% for training)
TimeQuestions: 6,970 train examples
Negatives for Retriever: Generated by corrupting time, relations, or entities (3 types: time incorrect, content incorrect, both incorrect)

Key Hyperparameters:

retriever_epochs: 10
llm_epochs: 2
rerank_weight_mu: 0.4
+ 1 more
negative_samples: 3 hard negatives per question

Compute: 2 NVIDIA A6000 GPUs

Comparison to Prior Work

vs. TempoQR/CronKGQA: TimeR4 uses LLMs for semantic reasoning and explicit rewriting rather than relying solely on graph embeddings
vs. ARI: TimeR4 fine-tunes a smaller local LLM (LLaMA2) with specialized retrieval rather than relying purely on prompting closed-source models
vs. General RAG [not cited in paper]: Adds explicit temporal rewriting and time-constraint filtering (Equation 9) which standard RAG lacks

Limitations

Requires ground truth answers and structured TKG facts, so not applicable to unstructured text corpora without extraction
Rewriting module relies on GPT-3.5 API, adding latency and cost compared to a fully local pipeline
Performance degrades if the number of retrieved facts is too high (>15) or too low
Evaluation metric issues: LLMs generate valid but non-standard formatted dates (e.g., 'May' vs '2012-05'), penalizing exact match scores

Reproducibility

Code: https://github.com/qianxinying/TimeR4

Code available at https://github.com/qianxinying/TimeR4. Uses OpenAI API (gpt-3.5-turbo) for rewriting. Uses LLaMA2-Chat-7B. Datasets MultiTQ and TimeQuestions are standard.

📊 Experiments & Results

Evaluation Setup

TKGQA on two datasets with structured quadruples

Benchmarks:

MultiTQ (Complex TKGQA (multi-granularity))
TimeQuestions (TKGQA (mostly year granularity))

Metrics:

Hits@1
Hits@10 (implied in standard TKGQA but paper table mainly reports Hits@1 equivalent or 'Accuracy')
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
TimeR4 outperforms all baselines on the MultiTQ dataset, showing large gains over both embedding-based methods and general LLMs.
MultiTQ	Hits@1	38.0	72.8	+34.8
MultiTQ	Hits@1	18.5	72.8	+54.3
TimeR4 also leads on TimeQuestions, though the margin is smaller due to simpler time granularity.
TimeQuestions	Hits@1	60.5	78.1	+17.6
MultiTQ	Hits@1	72.78	41.04	-31.74
MultiTQ	Hits@1	72.78	61.12	-11.66

Experiment Figures

Hits@1 performance vs. Number of Retrieved Facts (k) on both datasets

Venn diagrams comparing answer coverage of Time-aware Retriever vs Fact Retriever vs Ground Truth

Main Takeaways

Rewriting implicit temporal questions into explicit ones is critical; removing this step drops Hits@1 by ~11 points on MultiTQ
Time-aware retrieval (fine-tuned) combined with explicit time-filtering (rerank) outperforms standard semantic retrieval
Optimal number of retrieved facts is around 15; performance drops with fewer (lack of info) or more (noise)
General LLMs (ChatGPT/LLaMA2) perform poorly on TKGQA zero-shot due to hallucination and lack of specific temporal facts

📚 Prerequisite Knowledge

Prerequisites

Knowledge Graph Question Answering (KGQA)
Contrastive Learning
In-Context Learning (ICL)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

TKGQA: Temporal Knowledge Graph Question Answering—answering questions using facts stored as quadruples (subject, predicate, object, timestamp)

TKG: Temporal Knowledge Graph—a knowledge graph where every edge (fact) has an associated timestamp

FKS: Facts Knowledge Store—a vector index of all facts in the graph embedded using a standard language model

TKS: Temporal Knowledge Store—a vector index of all facts embedded using a fine-tuned time-aware encoder

Hits@1: A metric measuring the percentage of questions where the top-1 predicted answer is correct

SentenceBERT: A modification of the BERT network that uses siamese networks to derive semantically meaningful sentence embeddings

Contrastive Learning: A learning paradigm where the model learns to pull positive pairs closer and push negative pairs apart in vector space

Quadruple: A format for temporal facts: (subject, predicate, object, time)