Entity tracking: The ability of a model to infer and maintain properties associated with an entity (like an object in a box) previously defined in the context.
Path Patching: A technique to identify significant components (heads) by replacing their activations with those from a corrupted run and measuring the impact on the output logit.
Circuit: A subgraph of the model's components (specifically attention heads) responsible for a specific behavior or task.
DCM: Desiderata-based Component Masking—a method for automatically identifying model components responsible for specific semantic subtasks by defining task alternations.
CMAP: Cross-Model Activation Patching—a new method introduced in this paper to patch activations from one model (e.g., fine-tuned) into another (e.g., base) to localize performance improvements.
Faithfulness: A metric measuring the percentage of the full model's performance that is recovered by a specific sub-circuit acting in isolation.
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates low-rank decomposition matrices rather than all model weights.
OOD: Out-of-Distribution—data that is different from the training data distribution.
Q-composition: Information flow where a previous head affects the Query vector of a subsequent head.
V-composition: Information flow where a previous head affects the Value vector of a subsequent head.
Minimality: A criterion used to prune a circuit, ensuring only heads that significantly contribute to performance are retained.