TAXI: Evaluating Categorical Knowledge Editing for Language Models

📝 Paper Summary

Knowledge Editing Evaluation Benchmarks

TAXI is a new benchmark that evaluates whether editing a subject's category in a language model correctly updates its associated properties, revealing that current editors struggle with consistency compared to humans.

Core Problem

Current knowledge editing benchmarks fail to adequately evaluate 'consistency'—whether injecting a new fact (e.g., category change) correctly propagates to related facts (e.g., inherited properties).

Why it matters:

Effective model editing requires more than just recalling the specific edited fact; it must maintain a consistent worldview by updating logical consequences of that fact
Existing benchmarks like CounterFact or MQuAKE focus on paraphrases or multi-hop retrieval but lack clear ground truth for broad property inheritance
Inconsistent edits can lead to contradictory model outputs, where a model believes a subject belongs to a new category but still attributes old, conflicting properties to it

Concrete Example: If a model is edited to believe a 'cobra' is a 'dog', it should consistently infer that the cobra now 'barks' and 'has fur'. Current editors might successfully change the category to dog but still predict the cobra 'has scales' or 'slithers'.

Key Novelty

Taxonomic Inference (TAXI) Benchmark

Leverages the hierarchical nature of taxonomic categories (e.g., Animals, Vehicles) where category membership strictly entails specific properties
Evaluates 'Categorical Consistency': measuring if an edit to a subject's category (e.g., pitbull → cat) causes the subject to inherit the new category's properties (e.g., meows) and lose the old ones
Introduces a controlled dataset of 11,120 queries where ground truth property changes are unambiguous, enabling precise measurement of edit propagation

Architecture

Conceptual diagram of the TAXI benchmark task structure.

Evaluation Highlights

Human annotators achieve 86.8% consistency on the TAXI task, whereas the best model editor (ICE) achieves only ~45% consistency on changed properties
Editors like ROME and In-Context Editing (ICE) successfully update the category label (>90% success) but fail to consistently update the associated properties
Consistency is generally higher for 'atypical' subjects (rare entities) compared to typical ones, supporting recent findings that rare knowledge is easier to edit

Breakthrough Assessment

7/10

Offers a crucial, biologically-inspired metric (consistency via property inheritance) that exposes a major failure mode in current editing methods. The gap between edit success and property consistency is a significant finding.

⚙️ Technical Details

Problem Definition

Setting: Knowledge editing of Causal Language Models

Inputs: A base model f, a subject s, and a new target category c*

Outputs: An updated model f* that associates s with c* and correctly entails properties p belonging to c*

Pipeline Flow

Editor Function (applies edit s -> c*)
Evaluation (Standard/Reverse Queries)
Metric Calculation (Consistency/Invariance)

System Modules

Editor Function

Apply the categorical edit to the base model

Model or implementation: Llama-2-7B (Base Model)

Property Evaluator

Query the edited model on properties entailed by the new category

Model or implementation: Edited Model f*

Novel Architectural Elements

TAXI Dataset construction: 976 edits spanning 41 categories, explicitly pairing subjects with typical/atypical status and defining shared vs. unshared properties for consistency metrics

Modeling

Base Model: Llama-2-7B

Training Method: Finetuning (FT) and Rank-One Model Editing (ROME) applied as editing methods

Objective Functions:

Purpose: Minimize the negative log-likelihood of the target category given the subject in the edit prompt.

Formally: Standard causal language modeling loss on the edit example.

Adaptation: Rank-One Model Editing (ROME) updates specific MLP layers; Finetuning (FT) updates weights via gradient descent

Training Data:

TAXI Dataset: 41 categories, 164 subjects, 183 properties
Covariance statistics for ROME computed using 50,000 samples from Wikipedia

Key Hyperparameters:

covariance_samples: 50,000 (for ROME)

Compute: Single Nvidia A100 GPU

Comparison to Prior Work

vs. CounterFact: TAXI tests property inheritance (entailment), not just paraphrase robustness
vs. MQuAKE: TAXI uses taxonomic categories where ground truth is strictly defined, avoiding ambiguous multi-hop chains
vs. RippleEdits: TAXI focuses specifically on categorical edits and entailed properties, providing a cleaner test of 'worldview' consistency

Limitations

Dataset manually created, limited to 41 categories and concrete/everyday objects
Only evaluates Llama-2-7B; performance on larger or different architectures is unknown
Reverse queries (e.g., 'A type of cat is a...') fail for FT and ROME due to causal model nature, limiting evaluation scope
Limited to single-hop property inheritance, not complex reasoning chains

Reproducibility

Code: https://github.com/derekpowell/taxi

Code and data are publicly available at https://github.com/derekpowell/taxi. ROME and FT implementations utilize the EasyEdit library. ICE implementation is described as prepending 'Imagine that a <subject> was a kind of <category> ...' to prompts.

📊 Experiments & Results

Evaluation Setup

Apply edits (Subject -> New Category) then query model on properties. Compare edited model probabilities on multiple choice questions.

Benchmarks:

TAXI (Categorical Knowledge Editing / Property Entailment) [New]

Metrics:

Edit Success (ES)
Consistency (Acc on properties that should change)
Invariance (Acc on properties that should NOT change)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Edit Success (ES) measures if the model accepts the new category label. All editors perform well here.
TAXI	Edit Success (ES)	0.00	0.99	+0.99
TAXI	Edit Success (ES)	0.00	0.91	+0.91
Consistency measures if the model updates properties implied by the new category. Editors struggle significantly compared to humans.
TAXI	Consistency	0.87	0.45	-0.42
TAXI	Consistency	0.25	0.38	+0.13
TAXI	Consistency	0.25	0.26	+0.01
Invariance measures if the model preserves properties shared by both old and new categories. Editors generally preserve these well.
TAXI	Invariance	0.80	0.76	-0.04
TAXI	Invariance	0.80	0.80	0.00

Experiment Figures

Bar charts comparing Consistency and Invariance scores across FT, ROME, ICE, and Human baselines.

Performance on Reverse Queries (e.g., 'A type of dog is a [cobra]').

Main Takeaways

Edit Success does not imply Property Success: Models readily accept 'X is a Y' but fail to infer 'X therefore has property Z' (Consistency).
Property Invariance (keeping shared properties) is much stronger than Consistency (updating unique properties) across all editors.
ICE (In-Context Editing) generally outperforms ROME and FT on consistency metrics, but heavily relies on the prompt context.
Atypical subjects (rare entities) are easier to edit consistently than typical subjects, suggesting prior knowledge interferes with updates.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Editing techniques (ROME, Finetuning)
Taxonomic reasoning / Property inheritance
Causal Language Models (e.g., Llama-2)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

Categorical Consistency: A metric measuring the proportion of properties that correctly change to match a subject's newly assigned category after an edit

Invariance: A metric measuring the proportion of properties that correctly remain unchanged after an edit because they are shared by both the old and new categories

ROME: Rank-One Model Editing—a method that locates and alters specific factual associations in a model's MLP weights by treating them as key-value pairs

ICE: In-Context Editing—a method that prepends the desired edit (e.g., 'Imagine that a cobra is a dog') to the prompt context rather than changing model weights

EasyEdit: An open-source software framework used to implement and evaluate various knowledge editing methods

FT: Finetuning—updating model weights via gradient descent on the specific edit example

Taxonomy: A hierarchical classification system (e.g., Animal -> Dog -> Labrador) where lower levels inherit properties from higher levels

Superordinate category: A high-level category grouping (e.g., 'Animals', 'Vehicles') that contains the specific categories used in the edits

CounterFact: A prior dataset for knowledge editing that focuses on counterfactual updates but lacks the specific property inheritance structure of TAXI

MQuAKE: Multi-hop Question Answering for Knowledge Editing—a benchmark evaluating if edits propagate through multi-hop reasoning chains

RippleEdits: A benchmark measuring the 'ripple effects' of edits, checking if related facts update consistent with the primary edit