Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models

📝 Paper Summary

Time-Sensitive Question Answering (TSQA) Hallucination Detection Benchmarking

TDBench automates the creation of diverse time-sensitive QA pairs using temporal databases and SQL techniques to evaluate both answer accuracy and the validity of temporal reasoning in model explanations.

Core Problem

Existing Time-Sensitive Question Answering (TSQA) benchmarks rely on manual curation (costly, unscalable) or fixed templates (limited diversity), and often ignore whether the model's temporal reasoning explanation is actually correct.

Why it matters:

Facts evolve over time (e.g., presidents change), making static knowledge insufficient for reliability
LLMs frequently hallucinate explanations even when getting the final answer right, undermining trust
Current evaluation methods struggle to support application-specific data or complex multi-hop reasoning without heavy human labor

Concrete Example: A model correctly answers 'Carl XVI Gustaf' is the current monarch of Sweden but explains he has been monarch 'since 1974' (hallucinated date), which standard answer-only metrics fail to detect.

Key Novelty

Database-Driven TSQA Benchmark Construction

Uses Temporal Functional Dependencies (TFDs) to automatically identify facts that are uniquely determined by time (e.g., Country + Role → Name)
Generates questions via temporal SQL queries covering 13 exhaustive temporal relations (e.g., 'meet', 'overlap', 'during') rather than hand-written templates
Evaluates 'Time Accuracy' by checking if the specific dates mentioned in the model's explanation satisfy the temporal constraints defined in the generated SQL

Architecture

The three-step pipeline of TDBench: (1) Knowledge Selection using TFDs, (2) Query Generation creating Temporal SQL, and (3) QA Construction converting SQL to Natural Language.

Evaluation Highlights

Detected hallucinations in explanations for 21.7% of correctly answered questions on average across 8 LLMs
Achieved 91.1% agreement with human verification for the automated time accuracy metric
Identified that LLMs struggle significantly more with complex temporal relations like 'overlap' and 'meet' compared to simple ones like 'equal'

Breakthrough Assessment

8/10

Significantly advances TSQA by automating diverse question generation and introducing a reliable metric for verifying temporal reasoning, addressing a major blind spot in current answer-only evaluations.

⚙️ Technical Details

Problem Definition

Setting: Time-Sensitive Question Answering (TSQA) where models must provide both a correct final answer and valid temporal references in their explanation

Inputs: Natural language question q involving temporal constraints (e.g., 'Who was president 4 months before May 2019?')

Outputs: Answer a and explanation containing time references t

Pipeline Flow

Knowledge Selection (TFDs select attributes)
Query Generation (SQL queries created with temporal constraints)
Natural Language Conversion (LLM translates SQL to QA pairs)
Response Verification (Automated grading of answer and time references)

System Modules

Knowledge Selector (Construction)

Identify deterministic facts suitable for QA using Temporal Functional Dependencies

Model or implementation: Rule-based algorithm

GenQueries (Construction)

Construct temporal SQL queries covering 13 distinct temporal relations

Model or implementation: Algorithmic generation

Translator (Construction)

Convert SQL queries into natural language questions

Model or implementation: GPT-4o

Judge

Verify if the extracted answer and time references match the ground truth

Model or implementation: LLM-based evaluator (model not specified, likely GPT-4o based on context)

Novel Architectural Elements

Utilization of Temporal Functional Dependencies (TFDs) for systematic fact selection
Integration of SQL-based temporal constraints to automatically verify natural language explanations

Modeling

Base Model: Various (GPT-3.5, GPT-4, GPT-4o, Llama3.1-70B, Mixtral-8x7B, Gemma2-27B, Qwen2-72B, Granite3.1-8B)

Compute: Not reported in the paper

Comparison to Prior Work

vs. TempLAMA/Dyknow: TDBench uses temporal databases and TFDs for scalable generation rather than fixed templates
vs. TimeQA: TDBench automates question generation via SQL, reducing human labor and enabling application-specific data use
vs. SituationQA [not cited in paper]: TDBench specifically targets fine-grained temporal relations (Allen's algebra) rather than general situation changes

Limitations

Reliance on the completeness and accuracy of the underlying temporal database
LLM-based translation from SQL to natural language may introduce minor errors (though 91.5% accurate)
Verification of time references depends on the judge model's ability to extract dates correctly from free text

Reproducibility

Code: https://github.com/ssoy0701/tdbench.git

Code and data are publicly available at https://github.com/ssoy0701/tdbench.git. The paper details the algorithms for query generation and the prompts used for translation and evaluation.

📊 Experiments & Results

Evaluation Setup

Evaluated 8 LLMs on generated TSQA pairs in both open-book (with context) and closed-book settings.

Benchmarks:

TDBench-Wikipedia (TSQA on general knowledge (Countries, Athletes, etc.)) [New]
TDBench-Kaggle (TSQA on domain-specific data (Legal, Environmental, Netflix)) [New]

Metrics:

Answer Accuracy (A)
Time Accuracy (T)
Answer-Time Accuracy (AT)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Large gap between Answer Accuracy and Answer-Time Accuracy indicates high hallucination rates in explanations.
TDBench-Wikipedia	Answer Accuracy (A)	0.784	0.784	0.000
TDBench-Wikipedia	Answer-Time Accuracy (AT)	0.612	0.612	0.000
TDBench-Wikipedia	Answer Accuracy (A)	0.485	0.485	0.000
TDBench-Wikipedia	Answer-Time Accuracy (AT)	0.301	0.301	0.000
Models struggle with specific complex temporal relations.
TDBench (Aggregated)	Answer-Time Accuracy (AT)	0.85	0.45	-0.40

Experiment Figures

Radar chart showing LLM performance across 13 distinct temporal relations (e.g., before, meet, overlap).

Main Takeaways

LLMs often provide correct answers for the wrong reasons, with an average 21.7% drop when requiring correct time references in explanations
Performance varies significantly by temporal relation type; models struggle most with 'overlap' and 'meet' relations while excelling at 'equal'
Multi-hop questions reveal model-specific failure points; some models hallucinate early in the reasoning chain, others later
TDBench enables evaluation on custom/private data (e.g., corporate databases) unlike benchmarks fixed to public Wikipedia dumps

📚 Prerequisite Knowledge

Prerequisites

Understanding of relational database schemas (tables, attributes, rows)
Basic SQL syntax (SELECT, WHERE)
Familiarity with temporal reasoning concepts (Allen's interval algebra)

Key Terms

TSQA: Time-Sensitive Question Answering—QA tasks where the answer depends on when the question is asked or the specific time period mentioned

TFD: Temporal Functional Dependency—a database constraint where specific attributes (e.g., Country, Role) uniquely determine another attribute (e.g., Name) at any given time point

Temporal Join: A database operation that combines rows from two tables based on overlapping time intervals, useful for generating multi-hop questions

Time Accuracy: A metric measuring whether the specific dates mentioned in an LLM's explanation are factually correct according to the temporal constraints

TDBench: The proposed benchmark system that uses temporal databases to generate and evaluate TSQA pairs

Allen's Interval Algebra: A calculus for temporal reasoning defining 13 possible relations between time intervals (e.g., before, meets, overlaps)

Temporal SQL: SQL queries extended with operators to handle time intervals (e.g., OVERLAPS, CONTAINS)

Open-book: Evaluation setting where the model is provided with external context (database rows) to answer the question

Closed-book: Evaluation setting where the model must rely solely on its internal pre-trained knowledge