DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMs

📝 Paper Summary

Dynamic Benchmarking Knowledge Editing Factuality Evaluation

DyKnow evaluates LLMs against Wikidata to detect outdated knowledge and assess the effectiveness of knowledge editing methods on real-world time-sensitive facts.

Core Problem

Static benchmarks for evaluating LLM factuality quickly become obsolete and cannot detect outdated knowledge, while current knowledge editing studies rely heavily on synthetic datasets rather than real-world dynamic facts.

Why it matters:

Reliable knowledge repositories must maintain accurate, up-to-date information, but LLMs often output outdated facts based on their specific training snapshots
Static benchmarks suffer from data contamination (leakage into future training data) and fail to capture the dynamic nature of real-world facts
Existing editing research focuses on synthetic counterfacts, leaving a gap in understanding how editing methods perform on actual outdated knowledge in diverse domains

Concrete Example: When asked 'Who is Cristiano Ronaldo's current football club?', an LLM trained in 2020 might confidently answer 'Juventus FC', which is outdated. DyKnow uses Wikidata's timeline to identify this as a previously correct but now outdated answer (valid 2018-2021), whereas a static benchmark might just label it 'incorrect' or fail to update the ground truth to 'Al-Nassr'.

Key Novelty

Dynamic Knowledge Benchmarking via Knowledge Graphs

Uses Wikidata to retrieve the *current* attribute value at the exact time of evaluation, rather than relying on a static gold-standard dataset
Retrieves a history of previously correct values with valid time intervals to distinguish between 'outdated' answers (correct in the past) and 'irrelevant' answers (hallucinations)
Leverages the validity intervals of outdated answers to reverse-engineer and estimate the effective temporal cutoff of an LLM's pre-training data

Architecture

Conceptual workflow of DyKnow. It shows how a prompt (e.g., about Cristiano Ronaldo's team) is evaluated against a dynamic timeline retrieved from Wikidata.

Evaluation Highlights

GPT-4 (2023) achieves 80% accuracy on time-sensitive facts, while Llama-3-8B-Instruct (2024) reaches 76%, yet even best models have ~15-20% outdated/irrelevant answers
Outdatedness is severe in older models: GPT-2 (2019) provides outdated answers to 42% of questions, while newer models like OpenELM-1.1B still yield 47% outdated responses
Knowledge editing methods struggle with real-world data: ROME fails to scale, while MEMIT improves GPT-J accuracy but shows inconsistent results across different models like Llama-2-Chat

Breakthrough Assessment

7/10

Offers a practical, automated methodology for dynamic evaluation and training data dating. Highlights significant limitations in current knowledge editing methods on real data, though the scope of editing experiments is limited by compute.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of LLM outputs o against a dynamic set of facts F_t retrieved at evaluation time t.

Inputs: Natural language prompt p derived from a factual triplet (subject, property, attribute)

Outputs: Generated answer string, classified as Correct (matches current value), Outdated (matches past value), or Irrelevant (no match)

Pipeline Flow

Fact Selection (Entities & Properties)
Dynamic Retrieval (Wikidata)
Prompt Generation & Validation
Model Querying
Response Classification & Analysis

System Modules

Fact Retriever

Fetch subject entities (top GDP countries, top athletes) and retrieve current/historical attributes with validity intervals from Wikidata

Model or implementation: Wikidata API

Prompt Generator

Create diverse natural language queries for each fact to test consistency

Model or implementation: GPT-4 (for paraphrasing)

Response Validator

Match model output against the Wikidata value list to classify as Correct, Outdated, or Irrelevant

Model or implementation: String Matching / Heuristics

Novel Architectural Elements

Temporal Alignment Mechanism: Automatically correlates the validity intervals of 'outdated' answers to infer the latent pre-training data window of black-box LLMs

Modeling

Base Model: Evaluation covers 24 LLMs including GPT-4, Llama-3, Mistral, OLMo, and OpenELM

Training Method: Knowledge Editing (ROME, MEMIT, SERAC, IKE) applied to subsets of models (GPT-2, GPT-J, Llama-2-Chat, Mistral-Instruct)

Adaptation: Model editing modifies weights (ROME, MEMIT) or uses external memory/context (SERAC, IKE)

Trainable Parameters: Varies by editing method (feed-forward layers for ROME/MEMIT, external classifier for SERAC)

Training Data:

130 time-sensitive facts
Subjects: Top 50 countries by GDP, top 30 athletes, top 25 organizations

Compute: SERAC training required 1x NVIDIA A100 (80GB). Inference used 2x NVIDIA RTX 3090 (24.5GB) or 3x A100 for Mixtral.

Comparison to Prior Work

vs. RealTime QA: DyKnow focuses on distinguishing 'outdated' vs 'wrong' using validity intervals, enabling data dating, whereas RealTime QA focuses on new news [not cited in paper]
vs. Static Benchmarks (MMLU, TruthfulQA): DyKnow dynamically pulls ground truth at inference time, preventing obsolescence
vs. Synthetic Editing Datasets (zsRE, CounterFact): DyKnow evaluates editing on *real-world* outdated facts rather than arbitrary counterfactuals (e.g., 'Eiffel Tower is in Rome')

Limitations

Evaluation limited to 130 facts due to manual verification of prompt templates
Editing experiments restricted to smaller/older models (GPT-2, GPT-J, Llama-2-7B) due to compute constraints
Does not evaluate 'adding' or 'deleting' knowledge, only 'updating' existing facts
Reliance on Wikidata completeness (if Wikidata is missing a fact, it might be misclassified)

Reproducibility

Code: https://github.com/seyedmahed/DyKnow

Code and dataset link provided (https://github.com/seyedmahed/DyKnow). Dataset construction logic is clear (Wikidata queries). Editing experiments used EasyEdit framework with default configurations. Specific prompts are listed in the paper.

📊 Experiments & Results

Evaluation Setup

Zero-shot QA on time-sensitive facts. Models queried with 3 prompt variations per fact.

Benchmarks:

DyKnow (Custom) (Time-sensitive Fact Retrieval) [New]

Metrics:

Accuracy (Correctness)
Outdated Rate
Irrelevant Rate
Prompt Agreement (Consistency)
Editing Success Rate (Efficacy & Paraphrase)
Statistical methodology: Harmonic mean used for editing metrics. No significance tests reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Evaluation of model factuality shows that while newer models improve, outdatedness remains a major issue.
DyKnow	Correctness	42%	80%	+38%
DyKnow	Outdated Rate	42%	14%	-28%
DyKnow	Correctness	51%	76%	+25%
Knowledge editing experiments reveal mixed performance on real-world data compared to synthetic benchmarks.
DyKnow (Real-world edits)	Harmonic Mean (Success)	34.7	99.0	+64.3
DyKnow (Real-world edits)	Harmonic Mean (Success)	11.1	39.9	+28.8
DyKnow (Real-world edits)	Harmonic Mean (Success)	26.9	91.8	+64.9

Experiment Figures

Temporal distribution of validity intervals for model responses (Violin plots).

Scalability of editing methods (ROME, MEMIT, SERAC) as the number of edits increases.

Main Takeaways

Outdatedness is a persistent problem: Even state-of-the-art models like GPT-4 and Llama-3 output outdated information for ~15% of time-sensitive questions.
Input sensitivity is high: LLMs often give inconsistent answers (one correct, one outdated) to slightly rephrased prompts about the same fact.
Data dating works: Analyzing the validity intervals of 'outdated' answers successfully approximates the known training data cutoffs of models (e.g., GPT-4's cutoff aligns with 2023).
Editing methods are fragile: Algorithms like ROME and MEMIT, which succeed on synthetic data, often fail or show poor scalability on real-world updates, especially with newer instruction-tuned models like Mistral.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and their training lifecycle
Familiarity with Knowledge Graphs (Wikidata) structure (triplets, qualifiers)
Knowledge Editing concepts (ROME, MEMIT, SERAC, IKE)

Key Terms

Wikidata: A collaborative, multilingual knowledge graph where facts are stored as triplets with qualifiers like start/end dates

Temporal Validity Interval: The specific time range (start date to end date) during which a factual attribute value was considered correct

Knowledge Editing: Techniques to update or alter specific facts within an LLM without re-training the entire model

ROME: Rank-One Model Editing—a method to edit a single fact in an LLM by modifying feed-forward layer weights as a linear equality constraint

MEMIT: Mass-Editing Memory in a Transformer—an extension of ROME that updates multiple facts simultaneously across multiple layers

SERAC: Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model—uses an external memory of edits and a classifier to route queries to a scope-specific model

IKE: In-Context Knowledge Editing—uses in-context learning by prompting the model with the updated fact and demonstration examples rather than modifying weights

Prompt Agreement: A metric measuring the consistency of an LLM's answers across semantically equivalent but lexically different prompts