RelEdit: Evaluating Conceptual Knowledge Editing in Language Models via Relational Reasoning

📝 Paper Summary

Knowledge Editing Conceptual Knowledge Benchmarks

RelEdit is a benchmark for conceptual knowledge editing that evaluates not just if a concept definition changes, but whether related instances and hierarchical concept relationships update consistently.

Core Problem

Current knowledge editing evaluations focus on checking if a specific concept definition is updated but fail to assess whether the model correctly propagates these changes to related instances and conceptual hierarchies.

Why it matters:

Changing an abstract concept (e.g., 'Gender') should logically alter the classification of specific instances (e.g., 'Non-binary') and relationships with other concepts (e.g., 'Psychology')
Existing methods like ROME and MEMIT optimize for local factual updates and often fail to reason about the broader ontological consequences of a conceptual edit

Concrete Example: If the concept 'Gender' is edited from a biological binary to a spectrum including psychological identity, the model should correctly classify 'Non-binary' as an instance of Gender and link Gender to Psychology. Current models might update the definition text but fail these relational checks.

Key Novelty

RelEdit Benchmark & MICE Baseline

Introduces 5 novel metrics (Instance Change, Portability, Alignment Belong, etc.) to test if an edit propagates to instances and superclasses derived from DBpedia ontology
Proposes MICE (Memory-based In-Context Editing), a retrieval-based method that stores edits in external memory and prompts the model to reason about contradictions, avoiding direct parameter updates

Architecture

The workflow of MICE (Memory-based In-Context Editing)

Evaluation Highlights

Existing parametric methods (ROME, MEMIT) struggle with relational reasoning, scoring as low as 0.22-3.10 on Instance Change (IC) despite high reliability on direct definition recall
MICE outperforms parametric editors significantly on relational metrics, achieving ~92-93% Reliability and strong instance-level consistency (Instance Change ~18%) compared to near-zero for baselines
Larger models (e.g., Mistral-7B) generally handle relational reasoning challenges better than smaller models (e.g., GPT-2 XL) across all editing methods

Breakthrough Assessment

7/10

Identifies a critical gap in knowledge editing (conceptual ripple effects) and provides a rigorous benchmark. The proposed baseline (MICE) is simple but effective, though the low absolute scores on some metrics show the problem is far from solved.

⚙️ Technical Details

Problem Definition

Setting: Conceptual Knowledge Editing where a concept C=(c,d) is updated to C*=(c,d*), requiring consistency across related instances and hierarchy

Inputs: Edit request (concept name c, new definition d*), and test queries about related instances/concepts

Outputs: Binary classification (Yes/No) on whether relationships hold true under the new definition

Pipeline Flow

Edit Request Generation (DBpedia extraction)
MICE Retrieval (for baseline)
Conflict Detection (MICE)
Answer Generation

System Modules

Ontology Builder (Data Construction)

Extracts classes, instances, and superclasses from DBpedia; filters for instance counts

Model or implementation: SPARQL queries on DBpedia

Edit Generator (Data Construction)

Pairs concepts with target definitions from Intra (same superclass) or Inter (diff superclass) sources

Model or implementation: Rules + GPT-4 for paraphrasing

Memory Retriever (MICE Baseline)

Retrieves stored edited concept definitions relevant to the query

Model or implementation: Dense Retriever (e.g., Contriever/BERT-based)

Reasoning Engine (MICE Baseline)

Checks for contradictions between memory and internal knowledge, then generates answer

Model or implementation: LLM (e.g., Mistral-7B)

Novel Architectural Elements

Benchmark design: Hierarchical evaluation metrics (AB, AC) based on ontology structure
MICE Pipeline: Explicit memory retrieval + contradiction checking prompt for concept editing without parameter updates

Modeling

Base Model: Evaluated on GPT2-XL (1.5B), GPT-J (6B), LLaMA-2-7B, Mistral-7B-v0.1

Training Method: MICE (In-Context Learning + Retrieval) vs. Parametric Baselines (ROME, MEMIT, MEND)

Compute: Single A800 GPU for experiments

Comparison to Prior Work

vs. ConceptEdit: RelEdit adds relational reasoning metrics (instances, superclasses) rather than just definition checking
vs. RIPPLEEDITS: RelEdit focuses on abstract concepts (ontology) rather than concrete entities (facts)
vs. ROME/MEMIT: Paper shows these methods fail at relational reasoning despite success on factual editing benchmarks

Limitations

Does not verify part-whole relationships (e.g., if 'wheel' changes, how 'car' changes)
Excludes polysemous concepts (concepts with multiple meanings)
Relies on DBpedia which may be incomplete or outdated
Evaluated only on limited set of models (up to 7B parameters)

Reproducibility

Code: https://github.com/ivanniu/RelEdit

Code publicly available at https://github.com/ivanniu/RelEdit. Dataset construction details (SPARQL queries, filtering) provided. MICE implementation details (prompts) included.

📊 Experiments & Results

Evaluation Setup

Zero-shot relational reasoning queries after applying a single concept edit

Benchmarks:

RelEdit (Conceptual Knowledge Editing Evaluation) [New]

Metrics:

Reliability (Re)
Generalization (Ge)
Locality (Lo)
Instance Change (IC)
Portability (PO)
Instance Locality (IL)
Alignment Belong (AB)
Alignment Compare (AC)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of MICE against parametric editing methods (ROME, MEMIT) and PROMPT on Mistral-7B (Intra Setting).
RelEdit (Intra)	Instance Change (IC)	16.59	6.23	-10.36
RelEdit (Intra)	Portability (PO)	65.04	95.52	+30.48
RelEdit (Intra)	Alignment Belong (AB)	87.17	90.68	+3.51
RelEdit (Inter)	Alignment Compare (AC)	36.28	95.64	+59.36
Performance of existing methods (ROME) on GPT2-XL vs Mistral-7B (Intra Setting) showing model size impact.
RelEdit (Intra)	Reliability (Re)	86.47	96.47	+10.00
RelEdit (Intra)	Alignment Belong (AB)	76.77	95.58	+18.81

Experiment Figures

Average scores of Mistral-7B across editing methods comparing Intra vs Inter settings

Concept consistency analysis using GPT-4 evaluation on LLaMA-2-7B

Main Takeaways

Current parametric editing methods (ROME, MEMIT) are effective at defining concepts but fail to propagate these changes to related instances or superclasses (low consistency).
MICE (Memory-based In-Context Editing) achieves the best overall scores, suggesting that explicit retrieval and reasoning is currently superior to weight updates for conceptual consistency.
Intra-setting edits (within same superclass) generally yield higher consistency scores than Inter-setting edits, likely because the model's pre-existing knowledge structure supports the change.
Instance-level evaluations reveal that while models may accept a new definition textually, they struggle to apply that definition to reclassify concrete examples.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Editing (ROME, MEMIT)
Ontologies (Concepts, Instances, Superclasses)
In-Context Learning
Retrieval-Augmented Generation (RAG)

Key Terms

Conceptual Knowledge: Abstract understanding of categories, principles, and relationships (e.g., 'Gender'), distinct from concrete facts (e.g., 'Paris is in France')

Ontology: A structured set of concepts and categories showing their properties and the relations between them

DBpedia: A project that extracts structured content from the information created in the Wikipedia project

Intra-setting: Editing a concept where the target definition comes from a sibling concept under the same superclass (minor semantic shift)

Inter-setting: Editing a concept where the target definition comes from a different superclass entirely (major semantic shift)

Instance Change (IC): Metric checking if instances of the original concept are correctly reclassified under the edited concept

Portability (PO): Metric checking if instances of the target definition's original category are now accepted under the edited concept

Alignment Belong (AB): Metric checking if the edited concept is correctly classified under its new superclass

MICE: Memory-based In-Context Editing—the paper's proposed baseline that uses retrieval and prompting instead of weight updates