Multiplicative Orthogonal Sequential Editing for Language Models

📝 Paper Summary

Knowledge Editing Model Editing

MOSE replaces traditional additive parameter updates with multiplicative orthogonal transformations to preserve numerical stability and model performance during sequential knowledge editing.

Core Problem

Existing sequential editing methods use additive updates that progressively degrade the numerical stability (norm and condition number) of parameter matrices, damaging the model's general abilities.

Why it matters:

Sequential editing is essential for keeping LLMs up-to-date without expensive retraining, but current methods degrade rapidly after multiple updates
Loss of numerical stability leads to 'catastrophic forgetting' where the model loses general capabilities and reasoning skills while learning new facts

Concrete Example: After 4000 sequential edits on LLaMA3-8B using additive methods like ROME, the matrix condition number spikes, causing the model to fail on both retained knowledge and downstream tasks like summarization.

Key Novelty

Multiplicative Orthogonal Sequential Editing (MOSE)

Instead of adding a delta matrix (W + ΔW), MOSE left-multiplies the original weights by an orthogonal matrix (R * W), which mathematically preserves vector lengths and angles
Solves the update as an 'Orthogonal Procrustes Problem' to find the optimal rotation that aligns new knowledge while keeping old knowledge stable

Architecture

Comparison of Additive vs. Multiplicative editing paradigms.

Evaluation Highlights

+12.08% improvement in sequential editing performance compared to state-of-the-art baselines across LLaMA3-8B and Qwen2.5-7B
Retains 95.73% of general abilities on downstream tasks (like NLI and summarization) after extensive editing, significantly outperforming additive methods
Strictly maintains the Frobenius norm and condition number of parameter matrices even after 4000 sequential edits

Breakthrough Assessment

8/10

Offers a mathematically grounded departure from the dominant additive editing paradigm. effectively solving the stability issues that plague continuous model updating.

⚙️ Technical Details

Problem Definition

Setting: Sequential knowledge editing where a model f_θ is updated iteratively with a stream of edit pairs (xe, ye)

Inputs: A sequence of knowledge triplets or pairs (subject, relation, object) to be inserted

Outputs: An updated model parameter set θ' that outputs target ye for input xe while preserving behavior on unrelated inputs

Pipeline Flow

Layer Selection (Identify target layer l*)
Key-Value Computation (Determine current k and target v)
Orthogonal Optimization (Solve for R)
Parameter Update (W' = R * W)

System Modules

Layer Selector

Identify the most responsive layer for the specific knowledge to be edited

Model or implementation: Based on activation strength of FFN keys

Key-Value Generator

Compute the keys (input representations) and values (target outputs) for the least-squares objective

Model or implementation: Standard ROME/MEMIT calculation

Orthogonal Solver

Compute the orthogonal matrix R that maps keys to values while minimizing error

Model or implementation: Closed-form SVD solution

Novel Architectural Elements

Multiplicative update mechanism (W' = RW) replacing additive updates
Orthogonal transformation constraint explicitly enforced via Procrustes solution
Neighborhood layer updating (editing target layer + 2 adjacent layers)

Modeling

Base Model: LLaMA3-8B, LLaMA2-13B, Qwen2.5-7B

Training Method: Closed-form analytical update (no gradient descent)

Objective Functions:

Purpose: Minimize prediction error for new fact while preserving old facts.

Formally: min_R ||RW K_0 - W K_0||^2 + ||RW K_E - V_E||^2 subject to R^T R = I

Key Hyperparameters:

lambda: Not explicitly reported in the paper
batch_size: 10 (for batch-sequential experiments)
editing_steps: 4000 (single-sequential), 500 (batch-sequential)

Compute: Not reported in the paper

Comparison to Prior Work

vs. ROME/MEMIT: Multiplicative (R*W) vs. Additive (W+ΔW) updates; MOSE preserves stability metrics that explode in baselines
vs. RECT/AlphaEdit: These methods mitigate instability but fail at large scales (4000+ edits); MOSE remains stable indefinitely due to orthogonality
vs. PRUNE: MOSE guarantees invariant condition number theoretically, whereas PRUNE only constrains it [not cited in paper as comparison point, but PRUNE is a baseline in paper]

Limitations

Requires computing SVD which can be computationally intensive for very large matrices
Relies on the quality of key-value pairs generated by previous methods (ROME/MEMIT)
Current evaluation is limited to sequential editing of factual/conceptual knowledge

Reproducibility

Code: https://github.com/famoustourist/MOSE

Code is publicly available. Method relies on standard linear algebra operations (SVD), making it deterministic and reproducible given the same key-value pairs.

📊 Experiments & Results

Evaluation Setup

Sequential and Batch-Sequential editing tasks

Benchmarks:

ZsRE (Factual Knowledge Editing)
CounterFact (Factual Knowledge Editing)
ConceptEdit (Conceptual Knowledge Editing)

Metrics:

Editing Performance (Success Rate)
General Abilities (Downstream task performance: NLI, Summarization, QA, Sentiment)
Matrix Norm / Condition Number (Stability metrics)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MOSE significantly outperforms baselines in maintaining editing performance after large-scale sequential editing (4000 steps).
CounterFact (LLaMA3-8B)	Score	0.45	0.98	+0.53
Average across tasks	Improvement %	Not reported in the paper	Not reported in the paper	+12.08
Downstream Tasks	Retention Rate %	100.00	95.73	-4.27

Experiment Figures

Evolution of Matrix Norm and Condition Number over 4000 sequential edits.

Downstream task performance (NLI, Sentiment, etc.) vs. Number of Edits.

Main Takeaways

Additive editing methods cause parameter matrix condition numbers to explode, directly correlating with performance collapse.
MOSE's orthogonal updates keep condition numbers and norms effectively constant, allowing for thousands of edits without degradation.
Multi-layer editing (target + neighbors) is more effective than single-layer editing for robust knowledge injection.

📚 Prerequisite Knowledge

Prerequisites

Linear Algebra (Matrix norms, Condition numbers, SVD)
Transformer architecture (FFN layers)
Knowledge Editing basics (ROME/MEMIT frameworks)

Key Terms

Knowledge Editing: Techniques to precisely modify specific facts in an LLM without re-training the whole model

Orthogonal Matrix: A square matrix R where R^T * R = Identity; multiplying by it rotates vectors but preserves their length and relative angles

Frobenius Norm: A measure of the total magnitude of a matrix's elements; preserving this prevents parameters from exploding during updates

Condition Number: A metric indicating how sensitive a matrix is to input errors; high condition numbers imply numerical instability and poor generalization

Sequential Editing: Performing many editing operations one after another, which typically accumulates errors in traditional methods

Orthogonal Procrustes Problem: A mathematical problem of finding the best orthogonal matrix to map one set of points to another

SVD: Singular Value Decomposition—a factorization method used here to compute the optimal orthogonal update matrix

Additive Editing Paradigm: The standard approach where weights are updated by adding a delta matrix (W_new = W_old + ΔW)

Batch-sequential editing: A more challenging setting where multiple edits are applied in a batch at each step of a sequence