WikiFactDiff: A Large, Realistic, and Temporally Adaptable Dataset for Atomic Factual Knowledge Update in Causal Language Models

📝 Paper Summary

Knowledge update in LLMs Factual consistency evaluation

WikiFactDiff is a large-scale dataset capturing real-world factual changes between two Wikidata snapshots to evaluate how well language models handle realistic knowledge updates like entity insertion and archival.

Core Problem

Existing knowledge update datasets (like CounterFact and zsRE) rely on unrealistic, randomly generated changes and only cover fact replacement, ignoring common real-world scenarios like new entity insertion or archival.

Why it matters:

LLMs have a static nature and their knowledge decays over time, requiring updates to remain reliable in domains like healthcare or politics
Current benchmarks test simple replacements (e.g., changing a capital city randomly), which doesn't reflect how knowledge actually evolves (e.g., a politician leaving office)
Unrealistic updates in current datasets fail to test whether algorithms maintain global coherence or handle the emergence of completely new entities

Concrete Example: Current datasets might test updating 'Albert Einstein's field' from 'Physics' to 'Biology', which is unrealistic. WikiFactDiff captures real changes, such as 'Cristiano Ronaldo' moving from 'Juventus F.C.' (obsolete) to 'Al-Nassr' (new), or 'ChatGPT' emerging as a new entity entirely.

Key Novelty

WikiFactDiff: Realistic Temporal Difference Dataset

Constructs updates by computing the difference between two Wikidata snapshots (Jan 2021 and Feb 2023) to capture actual historical changes rather than synthetic ones
Categorizes updates into five distinct scenarios beyond just replacement: Archive, AddObject, AddRelation, AddEntity, and ReplaceObject
Introduces a 'temporal adaptability' pipeline that allows the dataset to be regenerated for any two dates to align with different model training cutoffs

Architecture

The WikiFactDiff dataset creation pipeline, illustrating how two Wikidata snapshots are processed to generate the final dataset.

Evaluation Highlights

ROME achieves highest Efficacy-Success (99.7%) and Generalization-Success (98.0%) on the replacement subset, outperforming MEMIT and FT
Fine-tuning (FT) generalizes well (99.5% success) but suffers from massive bleedover (specificity failure), degrading accuracy on neighboring facts
PROMPT (contextual prompting) is competitive with parameter-update methods (98.9% efficacy) but shows higher bleedover on random neighbors than ROME

Breakthrough Assessment

7/10

Significantly improves the realism of knowledge editing benchmarks by moving beyond synthetic replacements to real-world scenarios like entity insertion. However, it is primarily a dataset contribution rather than a new modeling technique.

⚙️ Technical Details

Problem Definition

Setting: Atomic Factual Knowledge Update: Modifying an LLM to reflect a change in a single fact (subject, relation, object)

Inputs: An update request consisting of a subject s, relation r, and target object o* (and optionally the obsolete object o)

Outputs: An updated language model P* that predicts o* for queries about (s,r) while maintaining performance on unrelated facts

Pipeline Flow

Wikidata Snapshots (Old & New) → Preprocessing & Diff → Triple Classification → Neighbor Search → Verbalization

System Modules

Preprocessing & Difference (Dataset Creation)

Filter raw Wikidata dumps and identify changed triples between two dates

Model or implementation: Rule-based scripts

Triple Classification (Dataset Creation)

Label each triple as 'new', 'obsolete', or 'static' to define the update scenario

Model or implementation: Hand-crafted logical rules

Verbalization (Dataset Creation)

Convert structured triples into natural language sentences for LLM input/evaluation

Model or implementation: ChatGPT (GPT-3.5) for template generation

Novel Architectural Elements

Temporal difference pipeline: A systematic method to generate datasets from ANY two knowledge base snapshots, ensuring temporal adaptability
Multi-scenario taxonomy: Formalization of update types beyond replacement (Archive, AddObject, AddRelation, AddEntity)

Modeling

Base Model: GPT-J (6 Billion parameters)

Training Method: Various Atomic Update Algorithms (ROME, MEMIT, MEND, FT)

Objective Functions:

Purpose: Maximize probability of new fact.

Formally: Maximize P(o* | s, r)
Purpose: Minimize change to other facts (Specificity).

Formally: Minimize KL divergence between pre- and post-update distributions on neighboring inputs

Adaptation: Atomic updates (editing specific weights) or Fine-tuning

Key Hyperparameters:

precision: bfloat16
device: RTX 3090 (24GB VRAM)

Compute: 5.2 seconds per update for ROME/MEMIT/GPT-J on RTX 3090. FT takes ~0.6s.

Comparison to Prior Work

vs. CounterFact: CounterFact uses random/synthetic changes (e.g., Einstein -> Biology); WikiFactDiff uses real historical changes (e.g., Ronaldo -> Al-Nassr)
vs. zsRE: zsRE lacks temporal metadata and diverse update types; WikiFactDiff includes archival and new entity scenarios
vs. Recent editing papers: Most focus only on replacement; WikiFactDiff provides data for insertion and deletion (archival) [not cited in paper]

Limitations

Evaluation experiments are limited to the 'Replacement' scenario; new scenarios (Archival, AddEntity) are provided but not benchmarked with new algorithms.
Reliance on Wikipedia popularity for filtering might bias the dataset toward Western-centric or highly popular entities.
Verbalization templates generated by ChatGPT might have limited linguistic diversity compared to human-authored text.
The 'Neighbor' search for specificity relies on TF-IDF similarity of Wikipedia pages, which is a proxy and might miss semantic relations not captured by lexical overlap.

Reproducibility

Code: https://github.com/Orange-OpenSource/WikiFactDiff

Dataset, generation code, and evaluation scripts are publicly available on GitHub and HuggingFace. The study uses open-source GPT-J-6B. Evaluation relies on OpenAI's ChatGPT for template generation (a closed component), but the templates themselves are released.

📊 Experiments & Results

Evaluation Setup

Atomic update of single facts in GPT-J using the replacement subset of WikiFactDiff (WFD_repl)

Benchmarks:

WikiFactDiff (Replacement Subset) (Fact Replacement (Atomic Knowledge Update)) [New]

Metrics:

Efficacy (Success/Difference): Ability to recall the new fact
Generalization (Success/Difference): Ability to recall the new fact under rephrased prompts
Specificity (Bleedover): Impact on neighboring/unrelated facts
Fluency: Quality of generated text after update
Statistical methodology: 95% confidence intervals reported for all metrics

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of different update algorithms on the WikiFactDiff replacement subset, measuring how well they implant new knowledge (Efficacy/Generalization) and avoid damaging existing knowledge (Bleedover).
WFD_repl	Efficacy Success (ES)	44.6	99.7	+55.1
WFD_repl	Generalization Success (GS)	44.4	98.0	+53.6
WFD_repl	Bleedover (K-nearest neighbors)	0.0	5.2	+5.2
WFD_repl	Efficacy Success (ES)	44.6	99.6	+55.0
WFD_repl	Generalization Success (GS)	44.4	53.6	+9.2
WFD_repl	Efficacy Success (ES)	44.6	98.9	+54.3

Experiment Figures

Impact of neighbor popularity and similarity on bleedover metrics.

Main Takeaways

Realistic updates are as challenging as synthetic ones: Algorithms show similar ranking patterns on WikiFactDiff as on CounterFact (ROME > MEMIT > MEND > FT).
Fine-tuning (FT) is extremely effective at memorizing the specific update sentence but fails to maintain specificity (high bleedover) or fluency.
Adding parameter constraints to fine-tuning (FT+L) fixes bleedover but destroys the ability to generalize to rephrased prompts.
Bleedover is significantly higher on semantically related neighbors (K-nearest) than random neighbors, confirming the importance of using K-nearest neighbors for specificity evaluation.
Subject popularity and similarity positively correlate with bleedover probability; updating popular entities is riskier for side effects.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Graphs (Triples: Subject, Relation, Object)
Language Model editing/updating techniques
Evaluation metrics for knowledge editing (Efficacy, Specificity, Generalization)

Key Terms

atomic update: Inserting, replacing, or removing a single simple fact within a model without retraining the whole model

bleedover: A negative side effect where updating a specific fact unintentionally alters the model's knowledge of other unrelated or neighboring facts

cloze test: A sentence with a missing word (blank) that the model must complete, used to evaluate factual knowledge (e.g., 'The capital of France is __')

temporal functional relation: A relation where a subject can only have one valid object at a specific point in time (e.g., 'current head of state' or 'population')

ROME: Rank-One Model Editing—an algorithm that updates specific facts in an LLM by modifying the weights of a specific layer using a rank-one matrix update

MEMIT: Mass-Editing Memory in a Transformer—an extension of ROME designed to perform thousands of updates simultaneously

MEND: Model Editor Networks with Gradient Decomposition—a hypernetwork-based approach that predicts weight updates efficiently

Wikidata: A structured, collaborative knowledge base used as the source of truth for constructing the dataset

TF-IDF: Term Frequency-Inverse Document Frequency—a statistical measure used here to compute similarity between entities for finding neighbors