Context Robust Knowledge Editing for Language Models

📝 Paper Summary

Knowledge Editing (KE) Context Robustness Factuality in LLMs

CoRE improves knowledge editing robustness by minimizing the variance of hidden states across diverse prefix contexts during the editing process, preventing models from reverting to outdated facts when distracted by conversation history.

Core Problem

Existing Knowledge Editing (KE) evaluations test models in isolation, but in real conversations, preceding contexts (especially those with relevant entities) often distract the model, causing it to revert to the original, outdated knowledge.

Why it matters:

Real-world applications like chatbots always involve dialogue history, making context-free evaluation unrealistic and overly optimistic
Standard KE methods fail significantly when a 'distractor' context is present, undermining their reliability for correcting hallucinations or updating facts
Prior benchmarks (CounterFact, MQuAKE) do not systematically evaluate the interference caused by semantically relevant preceding contexts

Concrete Example: If a model is edited to know 'Tim Cook works for Amazon' (instead of Apple), a preceding user question like 'Who's in charge of developing the iPhone?' may trigger the original association with Apple, causing the model to ignore the edit and output 'Apple'.

Key Novelty

Context Robust Editing (CoRE) & CHED Benchmark

Introduces CHED, a dataset of 'hop word' prefix contexts designed to distract the model by including entities semantically related to the original or edited fact
Proposes CoRE, a method that modifies the MEMIT objective by adding a regularization term that forces the model's hidden states to remain consistent (low variance) regardless of the preceding context

Architecture

Overview of the CoRE method. It illustrates the extraction of key-value pairs using relevant prefixes (s, o, o*) and the regularization of value vectors.

Evaluation Highlights

CoRE outperforms MEMIT by +17.2% in edit success rate on the CHED benchmark when facing distractive 'hop word' contexts
Significantly reduces the performance gap between context-free and contextual editing compared to baselines like ROME and MEMIT
Maintains high performance on general capabilities (downstream tasks) and fluency, showing that robust editing does not degrade the model's overall quality

Breakthrough Assessment

8/10

Identifies a critical failure mode in current KE methods (context sensitivity) and provides both a rigorous benchmark and an effective solution. The focus on 'hop words' as distractors is a strong insight.

⚙️ Technical Details

Problem Definition

Setting: Knowledge Editing (KE) where a factual association (s, r, o) is updated to (s, r, o*), evaluated under the presence of distractive prefix contexts

Inputs: An edit request (s, r, o*) and a prompt p with a preceding context x (x + p)

Outputs: The edited object o*

Pipeline Flow

Hop Word Collection: Extract entities related to (s,r,o) from Wikidata
Prefix Generation: Create distractive sentences using hop words
Key-Value Computation: Calculate new key-value pairs for the edit
Regularized Optimization (CoRE): Update weights to map keys to values while minimizing hidden state variance across contexts

System Modules

Hop Word Collector

Identify potential distractors

Context Generator

Create realistic distractive prefixes

Model or implementation: GPT-4o-mini

CoRE Editor

Update model weights with context robustness

Novel Architectural Elements

Variance Regularization Term: A new loss component in the weight update objective that penalizes the L2 distance between hidden states generated from different prefix contexts

Modeling

Base Model: Llama-3-8B-Instruct (also tested on GPT-J-6B)

Training Method: Locate-then-edit (Direct weight modification)

Objective Functions:

Purpose: Ensure the edited fact is generated.

Formally: Minimize L2 distance between projected key and target value (standard MEMIT term)
Purpose: Preserve unedited knowledge.

Formally: KL divergence penalty or preserving original key-value mappings (standard MEMIT term)
Purpose: Enforce context robustness (CoRE novelty).

Formally: L_prefix = sum ||h_i - h_j||^2 (minimizing pairwise squared L2 distances of hidden states across N prefix contexts)

Key Hyperparameters:

lambda_reg: 0.1 (regularization strength for L_prefix)
layers_edited: layers 4, 5, 6, 7, 8 (for GPT-J)
n_contexts: N (number of prefix contexts used during edit)

Comparison to Prior Work

vs. MEMIT: CoRE adds explicit regularization for hidden state variance across contexts and uses 'relevant' prefixes (s, o, o*) during the edit update instead of generic ones
vs. ROME: CoRE supports mass editing and context robustness, whereas ROME is single-edit and context-sensitive
vs. MQuAKE/RippleEdits [not cited in paper as direct baseline, but relevant]: CHED focuses on 'distractive context' robustness specifically, rather than multi-hop reasoning or logical consistency ripple effects

Limitations

Reliance on Wikidata connectivity for hop words might miss semantic relations not captured in the graph
Coherence of generated prefix contexts is moderate (3.4/5), potentially affecting realism
Focuses primarily on locate-then-edit paradigm; impact on meta-learning or weight-preserving methods is less explored
Computational cost of calculating pairwise distances for regularization scales with the number of context samples

Reproducibility

Code: https://github.com/holi-lab/CoRE

Dataset (CHED) and code are available at https://github.com/holi-lab/CoRE. CHED is built upon CounterFact. Prefix contexts generated using GPT-4o-mini.

📊 Experiments & Results

Evaluation Setup

Knowledge Editing task with preceding distractive contexts

Benchmarks:

CHED (Contextual Hop Editing Dataset) (Context-robust Knowledge Editing) [New]
CounterFact (Standard Knowledge Editing)

Metrics:

Edit Success Rate (ES)
Paraphrase Success Rate (PS)
Neighborhood Score (NS) (preservation of local knowledge)
Generation Quality (fluency/consistency)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on CHED benchmark showing CoRE's robustness against distractive hop-word contexts compared to baselines.
CHED	Edit Success Rate (ES)	69.1	86.3	+17.2
CHED	Edit Success Rate (ES)	78.1	95.6	+17.5
CHED	Edit Success Rate (ES)	56.4	86.3	+29.9
General capability retention results showing CoRE does not degrade model performance.
COPA (Downstream Task)	Accuracy	0.77	0.78	+0.01

Experiment Figures

Analysis of hidden state variance. Left: Variance of value vectors with different prefix types. Right: Reduction in pairwise L2 distance of hidden states with CoRE.

Impact of different hop word selection criteria on Edit Success.

Main Takeaways

Preceding contexts containing 'hop words' (entities related to original knowledge) significantly degrade the success rate of standard editing methods (MEMIT, ROME).
CoRE consistently restores edit success rates in the presence of these distractors without sacrificing performance on standard context-free evaluations.
The 'Freq-Sim' strategy (selecting low-frequency, high-similarity words) is most effective for identifying potent distractors for the benchmark.
The method preserves general language capabilities (COPA, Math, etc.) at a level comparable to or slightly better than MEMIT.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and Transformer architecture
Familiarity with Knowledge Editing (KE) methods, specifically locate-then-edit approaches
Basic linear algebra (for key-value memory operations in MLPs)

Key Terms

Knowledge Editing (KE): Techniques to update specific facts in an LLM (e.g., 'The president is X' -> 'The president is Y') without retraining the entire model

Locate-then-edit: A KE paradigm that identifies specific model weights responsible for a fact and directly modifies them

MEMIT: Mass-Editing Memory in a Transformer—a specific locate-then-edit method that updates multiple facts simultaneously by modifying MLP layers

Hop Words: Entities in a knowledge graph (like Wikidata) that are directly connected (one hop away) to the subject or object of a fact; used here as potent distractors

Prefix Context: Text appearing before the actual query or prompt (e.g., conversation history), which can influence the model's generation

Hidden State Variance: The degree to which the model's internal representation changes when the input context changes; CoRE aims to minimize this for the edited fact

Key-Value Memory: Interpretation of Transformer MLP layers where the input acts as a 'key' to retrieve a 'value' (fact) stored in the weights

KL Divergence: A statistical measure of how one probability distribution differs from another; used here to ensure the model doesn't change unrelated knowledge

G-Eval: A framework using LLMs (like GPT-4) to evaluate the quality or coherence of text