ThinkEval: Practical Evaluation of Knowledge Preservation and Consistency in LLM Editing with Thought-based Knowledge Graphs

📝 Paper Summary

Model Editing Knowledge Updating Factuality

ThinkEval reveals that current model-editing techniques fail to remove indirect knowledge leakage, allowing 'edited-out' facts to be reconstructed via multi-step reasoning chains.

Core Problem

Existing model-editing techniques and evaluations focus on isolated facts or simple multi-hop queries, failing to detect when 'deleted' knowledge persists through indirect causal links and reasoning chains.

Why it matters:

Privacy breaches: Sensitive information supposedly removed can be recovered through logical deduction
Misinformation persistence: Outdated or harmful knowledge remains accessible via indirect queries, undermining safety updates
Reliability: Models become inconsistent when direct queries yield updated facts but reasoning chains yield the old, incorrect facts

Concrete Example: If a model is edited to change Harry Potter's school to Ilvermorny but retains the link (Harry Potter → Gryffindor) and (Gryffindor → Hogwarts), a user can infer he studied at Hogwarts, bypassing the edit.

Key Novelty

Deep Editing Evaluation Framework

Introduces 'deep editing' as a stricter evaluation standard where an edited fact must not be deducible via any multi-step reasoning path
Uses Chain-of-Thought reasoning to reverse-engineer model-specific knowledge graphs, identifying implicit logical paths that current editors miss
Proposes a new metric, Indirect Fact Recovery (IFR), to quantify how easily original facts can be reconstructed through sequential prompting

Architecture

The ThinkEval framework workflow for generating evaluation datasets. It shows the cyclical process of extracting knowledge from an LLM.

Evaluation Highlights

AlphaEdit fails to suppress indirect leakage in >80% of samples, despite success on direct queries
ROME and MEMIT show high Indirect Fact Recovery scores (0.35–0.60 range), indicating significant retention of 'edited' knowledge
Trade-off identified: Techniques that suppress indirect leakage (like RECT) often cause catastrophic damage to broader contextual knowledge (low Preservation scores)

Breakthrough Assessment

8/10

Exposes a fundamental flaw in current model editing: 'successful' edits are often superficial. The framework and dataset provide a necessary rigorous standard for future safety-critical editing.

⚙️ Technical Details

Problem Definition

Setting: Model editing where a target fact t=(s,r,o) is updated, and the goal is to ensure t is not deducible from the updated model's knowledge graph G' via any reasoning path.

Inputs: An LLM to be edited, a target edit (subject, relation, new_object), and a set of logical implication chains.

Outputs: Evaluation metrics quantifying the recoverability of the original fact (IFR) and the integrity of unrelated knowledge (Preservation).

Pipeline Flow

Query Validation & Refinement: Generate and validate queries for triplets/chains
Automated Triplet Generation: Use CoT to extract model-internal knowledge into triplets
Graph Synthesis: Construct knowledge graph and logical chains
Dataset Compilation: Create sequential prompting sequences

System Modules

Query Validation (Data Construction)

Verify if the LLM recognizes a specific relationship or chain before adding it to the graph

Model or implementation: Target LLM (e.g., Qwen2.5-7B)

Triplet Extractor (Data Construction)

Parse CoT responses into atomic facts (triplets)

Model or implementation: Target LLM with specific prompting

Novel Architectural Elements

Cyclical dataset generation pipeline: Validated triplets feed back into query generation to iteratively expand the knowledge graph depth
Model-specific graph construction: Unlike static benchmarks, ThinkEval builds graphs based on what the specific model actually 'knows', ensuring valid testing of internal consistency

Modeling

Base Model: Qwen2.5-7B-Instruct, Meta-Llama-3-8B-Instruct, GPT2-XL (1.5B)

Training Method: Not applicable — this is an evaluation framework applied to existing editing methods

Compute: All experiments performed on an NVIDIA A100 80GB GPU

Comparison to Prior Work

vs. MQuAKE: Tests sequential recovery (step-by-step probing) rather than single-prompt multi-hop reasoning, better mimicking user interrogation
vs. RippleEdits: Constructs deep implication chains (up to 5 steps) to find remote leakage paths
vs. UnKEBench [not cited in paper]: Focuses on structured logical chains rather than unstructured text editing

Limitations

Graph construction requires human oversight to verify extracted triplets
Computational cost of generating model-specific knowledge graphs is high
Evaluation is limited to 5-step chains; deeper leakage might exist

Reproducibility

Code: https://github.com/manitbaser/KnowGIC

Dataset (KnowGIC) and code are available at https://github.com/manitbaser/KnowGIC. Implementation of editing methods uses the EasyEdit library.

📊 Experiments & Results

Evaluation Setup

Deep editing evaluation using sequential prompting on the KnowGIC benchmark

Benchmarks:

KnowGIC (Sequential multi-step reasoning / Indirect fact recovery) [New]

Metrics:

Indirect Fact Recovery (IFR)
Preservation
Efficacy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Evaluation of parameter-modifying editing techniques on Qwen2.5-7B-Instruct using the KnowGIC benchmark. High IFR (lower is better) indicates failure to suppress indirect leakage. High Preservation (higher is better) indicates safety of unrelated knowledge.
KnowGIC	IFR (Indirect Fact Recovery)	0.48	0.33	-0.15
KnowGIC	IFR (Indirect Fact Recovery)	0.48	0.19	-0.29
KnowGIC	Preservation	1.00	0.98	-0.02
KnowGIC	Preservation	1.00	0.58	-0.42
Evaluation on Llama-3-8B-Instruct showing similar trends.
KnowGIC	IFR (Indirect Fact Recovery)	0.54	0.45	-0.09

Experiment Figures

Motivating example of indirect leakage. Editing 'Harry Potter's school' to 'Ilvermorny' fails because the model still knows 'Harry -> Gryffindor' and 'Gryffindor -> Hogwarts'.

Main Takeaways

Current editing techniques face a severe trade-off: methods that effectively hide the fact (low IFR like RECT) destroy broader knowledge (low Preservation).
Precise methods like ROME and MEMIT fail deep editing tests: they update the specific fact but leave reasoning chains intact, allowing trivial recovery of the old fact.
AlphaEdit offers the best balance but still leaks the original fact in >30% of reasoning paths, making it unsafe for high-stakes privacy or safety edits.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs)
Knowledge of Knowledge Graphs (entities, relations, triplets)
Familiarity with Model Editing techniques (ROME, MEMIT)

Key Terms

deep editing: An evaluation setting where an edit is successful only if the original fact cannot be deduced through any multi-step reasoning chain

Indirect Fact Recovery (IFR): A metric measuring the probability that the original 'edited-out' fact can still be deduced via sequential reasoning queries

sequential prompting: Probing a model with a series of connected questions where the answer to one becomes the input to the next

deductive closure: The set of all facts, explicit and implicit, that can be logically derived from a knowledge graph

implication chain: A sequence of facts (A→B→C) that logically implies a target fact (A→C)

ripple effects: Unintended changes to unrelated knowledge caused by modifying model parameters

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer