Model Editing: Techniques to precisely modify specific knowledge or behaviors in an LLM's weights without full retraining
ROME: Rank-One Model Editing—a method that treats MLP layers as key-value stores and uses rank-one updates to modify specific factual associations
MEMIT: Mass-Editing Memory in a Transformer—a successor to ROME that allows editing thousands of facts simultaneously
FT-L: Constrained Fine-Tuning—fine-tuning that targets specific layers identified by causal tracing with norm constraints to minimize collateral damage
FT-M: Fine-Tuning with Masking—fine-tuning that optimizes the target answer while masking the original text to focus updates
UPQA: User Preference Question Answering—a new benchmark introduced in this paper to test recall of user attributes via explicit and implicit questions
PREFEVAL: A multi-turn conversation benchmark for evaluating preference following in LLMs
Implicit Query: A question that requires reasoning about a user's preference without explicitly stating the attribute (e.g., 'What should I do?' implying a hobby preference)
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning method that injects trainable low-rank matrices into transformer layers
Acknowledgment Rate: The percentage of model responses that demonstrate awareness of the user's specific preference