Can Knowledge Editing Really Correct Hallucinations?

📝 Paper Summary

Knowledge Editing Evaluation Hallucination Correction

HalluEditBench evaluates knowledge editing methods on pre-verified hallucinations rather than generic facts, revealing that techniques often fail to correct actual errors or degrade model capabilities in unexpected ways.

Core Problem

Existing knowledge editing benchmarks (like ZsRE, WikiBio) do not verify if the model actually hallucinates the target facts before editing, leading to distorted effectiveness scores.

Why it matters:

If a model already knows a fact (high pre-edit accuracy), applying an edit to 'correct' it is a false premise for evaluating hallucination correction
High post-edit scores on previous benchmarks disguise the true failure rates of editing methods when applied to real-world errors
Side effects like damaged reasoning (portability) or susceptibility to user pressure (robustness) are insufficiently tested

Concrete Example: A model might already correctly answer 'Who is the president of France?' (Macron). Evaluating a method's ability to 'fix' this knowledge is meaningless. HalluEditBench filters for facts the model specifically gets wrong (e.g., answering 'Ilya Sutskever' instead of 'Jakub Pachocki' for OpenAI's Chief Scientist) to test true correction.

Key Novelty

HalluEditBench (Verified Hallucination Benchmark)

Constructs a dataset where every target fact is confirmed to be hallucinated by the specific LLM (Llama-2, Llama-3, Mistral) before any editing occurs, ensuring a strict 0% pre-edit baseline
Evaluates editing methods across five distinct facets: Efficacy (did it fix the error?), Generalization (rephrased questions), Portability (multi-hop reasoning), Locality (side effects), and Robustness (resistance to user challenge)

Architecture

The construction pipeline of HalluEditBench, illustrating the two phases: Hallucination Collection and Evaluation QA Generation.

Evaluation Highlights

Common methods like FT-M and MEMIT show ~100% efficacy on old benchmarks but drop to ~60% on verified hallucinations in HalluEditBench
Knowledge editing often harms reasoning: most methods (except ICE) lower performance on multi-hop questions (Portability) compared to the unedited model
Parameter-preserving methods (ICE, GRACE) significantly outperform parameter-modifying methods (ROME, MEMIT, FT) in correcting hallucinations (Efficacy)

Breakthrough Assessment

8/10

Crucial methodological correction for the field. It exposes that current editing methods are far less effective on *actual* errors than previously thought, likely shifting future evaluation standards.

⚙️ Technical Details

Problem Definition

Setting: Given an LLM that hallucinates a fact (s, r, o_hallucinated), apply an edit e to update the knowledge to (s, r, o*) such that the model answers o*.

Inputs: A factual question q where the model generates a hallucinated answer

Outputs: The edited model's response to q and related generalization/portability/locality questions

Pipeline Flow

Hallucination Detection (Filter Wikidata triplets where model answers incorrectly)
Dataset Construction (Generate 5 types of evaluation QA pairs)
Knowledge Editing (Apply 7 methods to fix hallucinations)
Holistic Evaluation (Measure 5 metrics: Efficacy, Generalization, Portability, Locality, Robustness)

System Modules

Hallucination Filter

Identify facts the specific LLM gets wrong to ensure 0% pre-edit accuracy

Model or implementation: Target LLM (Llama-2, Llama-3, Mistral)

Editor

Apply specific editing algorithm to correct the hallucination

Model or implementation: Various (ROME, MEMIT, FT-L, FT-M, LoRA, ICE, GRACE)

Evaluator

Assess performance across 5 dimensions using GPT-4o generated QA pairs

Model or implementation: GPT-4o (for QA generation)

Novel Architectural Elements

Constructing evaluation sets dynamically based on *model-specific* failures rather than a fixed static dataset
Robustness evaluation protocol involving multi-turn user pressure (sycophancy stress test) on edited facts

Modeling

Base Model: Llama-2-7B-Chat, Llama-3-8B-Instruct, Mistral-v0.3-7B-Instruct

Training Method: Various Knowledge Editing techniques

Adaptation: Varies by method: LoRA (rank=8), ROME/MEMIT (layer updates), FT (constrained fine-tuning)

Trainable Parameters: Varies (e.g., specific MLP layers for ROME, adapter weights for LoRA/GRACE)

Training Data:

9 domains, 26 topics
Filtered from ~143k Wikidata triplets
~2,000 verified hallucinations sampled per LLM for final benchmark

Key Hyperparameters:

LoRA_rank: 8
LoRA_alpha: 32
LoRA_dropout: 0.1
+ 3 more
FT_learning_rate: 5e-4 to 1e-5 (depending on method)
MEMIT_layers: 4, 5, 6, 7, 8 (Llama-2/3)
ROME_layer: 5 (Llama-2/3)

Compute: Experiments run on NVIDIA A6000 GPUs

Comparison to Prior Work

vs. Existing Benchmarks (ZsRE, WikiBio): HalluEditBench ensures pre-edit performance is 0% (verified hallucination), whereas others do not check if the model actually needs editing
vs. ROME/MEMIT original papers: Evaluates on 5 holistic dimensions including Robustness (multi-turn pressure) and Portability (multi-hop), not just rewrite success
vs. EasyEdit [not cited in paper]: Uses EasyEdit as the underlying library but provides a more rigorous data filtering methodology

Limitations

Evaluation relies on GPT-4o for QA pair generation, which may introduce its own biases
Does not cover all possible editing methods (selected 7 representative ones)
Focuses on factual triples, may not generalize to procedural knowledge editing
Robustness metric focuses on sycophancy/user pressure, not adversarial attacks

Reproducibility

Code: https://github.com/llm-editing/HalluEditBench

Benchmark data and code publicly available at https://github.com/llm-editing/HalluEditBench. Uses GPT-4o for evaluation data generation (closed source dependency). Editing method implementations based on EasyEdit library.

📊 Experiments & Results

Evaluation Setup

Benchmarking 7 knowledge editing methods on Llama-2-7B, Llama-3-8B, and Mistral-v0.3-7B using ~2,000 verified hallucinations per model.

Benchmarks:

HalluEditBench (Knowledge Editing / Hallucination Correction) [New]

Metrics:

Efficacy Score (%) - Correctness on the exact edited fact
Generalization Score (%) - Accuracy on rephrased/related questions
Portability Score (%) - Accuracy on multi-hop reasoning using the edited fact
Locality Score (%) - Rate of preserving answers to unrelated questions
Robustness Score (%) - Rate of insisting on the edited fact under user pressure
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Discrepancy between standard benchmarks and HalluEditBench: Methods that appear perfect on unverified datasets fail on actual hallucinations.
Existing Datasets (WikiData_recent)	Accuracy	45.03	99.88	+54.85
HalluEditBench	Efficacy Score	0.00	60.50	+60.50
Efficacy comparison across editing paradigms shows parameter-preserving methods outperform parameter-modifying ones.
HalluEditBench	Efficacy Score	65.3	95.5	+30.2
HalluEditBench	Efficacy Score	59.2	93.4	+34.2
Side effects on reasoning (Portability) and stability (Locality).
HalluEditBench	Locality Score	39.6	60.2	+20.6
HalluEditBench	Portability Score	32.0	15.0	-17.0

Experiment Figures

Efficacy Scores (%) of 7 editing methods across 9 domains for three LLMs.

Portability Scores (multi-hop reasoning accuracy) for pre-edit vs post-edit models across 1-6 hops.

Main Takeaways

Knowledge editing methods are less effective than previously claimed: when applied to verified hallucinations (0% pre-edit accuracy), 'SOTA' methods like MEMIT and FT-M often achieve only ~60% efficacy.
Parameter-preserving methods (ICE, GRACE) consistently outperform parameter-modifying methods (ROME, MEMIT) in correcting facts.
Editing damages reasoning capabilities: Most methods (except ICE) reduce the model's ability to answer multi-hop questions (Portability) compared to the unedited model.
Locality is a major weakness: Apart from ICE and FT-M, most methods have locality scores below 40%, meaning they significantly corrupt unrelated knowledge.
ICE (In-Context Editing) is the most robust and balanced method overall, though it suffers in Robustness (susceptibility to user pressure) compared to parameter updates.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Editing (updating specific facts in LLMs without retraining)
Hallucination (generation of non-factual content)
Model Editing techniques (ROME, MEMIT, LoRA, etc.)

Key Terms

Knowledge Editing: Techniques to insert or update specific factual knowledge in a trained language model without full retraining

Hallucination: When a language model generates plausible-sounding but factually incorrect information

ROME: Rank-One Model Editing—a method that locates and updates a specific factual association in a model's MLP layers

MEMIT: Mass-Editing Memory in a Transformer—an extension of ROME designed to edit thousands of facts simultaneously

ICE: In-Context Editing—providing the corrected fact directly in the prompt context rather than modifying model weights

GRACE: A memory-based editing method that uses a discrete codebook to intercept and adjust activations for specific inputs

Portability: Whether the model can use the edited knowledge to answer downstream reasoning questions (e.g., multi-hop queries)

Locality: Whether the edit remains specific to the target fact without changing answers to unrelated questions

Sycophancy: The tendency of a model to agree with the user's input, even if the input contradicts its own knowledge