Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking

📝 Paper Summary

Mechanistic Interpretability Fine-tuning dynamics

Fine-tuning improves entity tracking not by creating new circuits, but by enhancing the existing circuit's ability to handle positional information and attend to correct objects.

Core Problem

While fine-tuning on generalized tasks (like math or code) improves performance on specific capabilities like entity tracking, it is unknown how this process alters the model's internal computational mechanisms.

Why it matters:

Understanding whether fine-tuning fundamentally rewires a model or merely amplifies existing capabilities is crucial for safety and controllability.
Entity tracking is a fundamental component of complex reasoning, yet the mechanistic effect of fine-tuning on this specific capability remains elusive.
Prior work suggests fine-tuning might destroy OOD knowledge or shift weights, but lacks precise circuit-level explanations for performance gains.

Concrete Example: In the task 'The apple is in box F... Box F contains the', a base model might fail to link 'Box F' to 'apple'. A math-tuned model succeeds. Does it use a new 'reasoning circuit' or just run the old one better?

Key Novelty

Mechanism Enhancement via Positional Augmentation

Demonstrates that base and fine-tuned models use the *same* sparse attention circuit for entity tracking, rather than developing new pathways.
Identifies that the performance boost comes from improved positional information handling: the circuit better tracks *where* the relevant entity is, rather than changing *how* it extracts the information.
Introduces Cross-Model Activation Patching (CMAP) to transplant activations between models, proving that fine-tuned mechanisms can improve base model performance when grafted.

Architecture

The Entity Tracking Circuit discovered in Llama-7B, showing four groups of attention heads (A, B, C, D) and their connectivity.

Evaluation Highlights

The sparse entity-tracking circuit discovered in Llama-7B (72 heads) recovers 100% (faithfulness score 1.0) of the base model's performance.
The *same* 72-head circuit from the base model recovers 97% of the performance in Vicuna-7B and ~88-89% in arithmetic-tuned models (Goat-7B/FLoat-7B).
Fine-tuned models (Goat-7B, FLoat-7B) achieve near-perfect accuracy (>97%) on entity tracking compared to Llama-7B's 54%.

Breakthrough Assessment

8/10

Strong mechanistic evidence that fine-tuning preserves and amplifies existing circuits rather than creating new ones. Introduces novel patching methods (CMAP) and provides a clear case study on entity tracking.

⚙️ Technical Details

Problem Definition

Setting: Entity tracking in text: identifying the content of a specific container defined earlier in the context.

Inputs: A sequence describing object locations (e.g., 'The apple is in box F...') followed by a query (e.g., 'Box F contains the').

Outputs: The token corresponding to the object in the queried container (e.g., 'apple').

Pipeline Flow

Circuit Discovery (Path Patching on Llama-7B)
Circuit Pruning (Minimality analysis)
Circuit Evaluation (Faithfulness on Base & Fine-tuned models)
Mechanism Analysis (DCM & CMAP)

System Modules

Circuit Discovery (Analysis)

Identify attention heads involved in entity tracking

Model or implementation: Llama-7B

Circuit Pruning (Analysis)

Remove redundant heads to find minimal circuit

Model or implementation: Llama-7B

Mechanism Analysis (Analysis)

Determine semantic roles of head groups and localize improvements

Model or implementation: All models (Base + Fine-tuned)

Novel Architectural Elements

Cross-Model Activation Patching (CMAP): A methodology to patch activations across different model versions (base vs. fine-tuned) to pinpoint mechanistic improvements.

Modeling

Base Model: Llama-7B

Training Method: Analysis of existing fine-tuned models (Vicuna, Goat) and one custom fine-tuned model (FLoat)

Adaptation: LoRA (for Goat-7B), Full fine-tuning (for Vicuna-7B and FLoat-7B)

Trainable Parameters: Varies by model (LoRA parameters vs full weights)

Training Data:

Vicuna: ShareGPT conversations
Goat/FLoat: Synthetic arithmetic expressions

Compute: Not reported in the paper

Comparison to Prior Work

vs. Tracr: Analyzes real-world LLMs (Llama-7B) rather than constructed toy models [not cited in paper]
vs. Jain et al. (2023): Both find fine-tuning preserves capabilities, but this paper focuses on real LLMs (Llama) rather than Tracr models.

Limitations

Analysis is limited to a specific synthetic entity tracking task; generalization to other tasks is unproven.
Focuses primarily on attention heads, considering MLPs as generic computation blocks without detailed analysis.
The minimal circuit definition relies on a greedy pruning approach which might miss synergistic effects.

Reproducibility

Code: https://finetuning.baulab.info

Code, data, and the FLoat-7B model are available at https://finetuning.baulab.info. The paper analyzes open models (Llama-7B, Vicuna-7B, Goat-7B).

📊 Experiments & Results

Evaluation Setup

Synthetic entity tracking task: 7 boxes with random labels and single-token objects.

Benchmarks:

Synthetic Entity Tracking (State tracking / Retrieval) [New]

Metrics:

Accuracy
Faithfulness (Circuit Accuracy / Model Accuracy)
Completeness
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance baselines show that fine-tuned models significantly outperform the base Llama-7B model on the entity tracking task.
Synthetic Entity Tracking	Accuracy	0.54	0.99	+0.45
Faithfulness results demonstrate that the circuit discovered in the base model accounts for the vast majority of performance in the fine-tuned models.
Synthetic Entity Tracking	Faithfulness	0.01	0.89	+0.88
Synthetic Entity Tracking	Faithfulness	1.0	0.88	-0.12

Main Takeaways

Fine-tuning preserves the mechanistic operation of the model: the same circuit topology (heads at specific positions) implements entity tracking in both base and fine-tuned versions.
The mechanism functions by tracking the *position* of the correct entity (positional transport) rather than transporting the entity's name or property directly.
Performance improvements are primarily driven by enhanced handling of positional information within the existing circuit, specifically in how the circuit attends to the correct object token.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention heads, Residual streams)
Mechanistic Interpretability (Path Patching, Circuits)
Fine-tuning techniques (LoRA, Full fine-tuning)

Key Terms

Entity tracking: The ability of a model to infer and maintain properties associated with an entity (like an object in a box) previously defined in the context.

Path Patching: A technique to identify significant components (heads) by replacing their activations with those from a corrupted run and measuring the impact on the output logit.

Circuit: A subgraph of the model's components (specifically attention heads) responsible for a specific behavior or task.

DCM: Desiderata-based Component Masking—a method for automatically identifying model components responsible for specific semantic subtasks by defining task alternations.

CMAP: Cross-Model Activation Patching—a new method introduced in this paper to patch activations from one model (e.g., fine-tuned) into another (e.g., base) to localize performance improvements.

Faithfulness: A metric measuring the percentage of the full model's performance that is recovered by a specific sub-circuit acting in isolation.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates low-rank decomposition matrices rather than all model weights.

OOD: Out-of-Distribution—data that is different from the training data distribution.

Q-composition: Information flow where a previous head affects the Query vector of a subsequent head.

V-composition: Information flow where a previous head affects the Value vector of a subsequent head.

Minimality: A criterion used to prune a circuit, ensuring only heads that significantly contribute to performance are retained.