Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models

📝 Paper Summary

Mechanistic Interpretability (MI) Circuit Discovery Causal Variable Localization

This shared task evaluates mechanistic interpretability methods by measuring how well they recover causal circuits and interpretable features across four language models and five reasoning tasks.

Core Problem

Systematically comparing mechanistic interpretability methods is difficult due to the lack of standardized frameworks for evaluating how well techniques identify circuits or localize causal variables.

Why it matters:

Without standardized benchmarks, it is impossible to determine which interpretability methods faithfully explain model behavior versus providing illusory explanations.
Reproducibility in interpretability research is often low, hindering progress in understanding how large language models implement specific behaviors or reasoning steps.

Concrete Example: In the IOI (Indirect Object Identification) task, a model must complete 'John gave an apple to _' with 'Mary'. A circuit discovery method might identify a subgraph responsible for this, but without a benchmark like MIB, we cannot rigorously quantify if this subgraph is minimal and sufficient compared to other methods.

Key Novelty

BlackboxNLP 2025 Shared Task on Mechanistic Interpretability

Establishes a standardized competition based on the MIB (Mechanistic Interpretability Benchmark) to evaluate circuit discovery and causal variable localization methods on hidden test sets.
Introduces a rigorous evaluation pipeline using counterfactual interventions (activation patching) to measure the faithfulness of discovered circuits and featurized variables.
Reveals that ensembling attribution methods and using non-linear projections for features significantly outperform standard baselines.

Evaluation Highlights

Hybrid ensembling of attribution methods achieves the highest CPR (Circuit Performance Ratio) across multiple tasks, outperforming individual edge patching methods.
Non-linear featurizers (using MLPs) achieve near-perfect faithfulness (0.99-1.0) on Arithmetic tasks with Llama-3.1-8B, significantly outperforming linear DAS baselines.
Bootstrapping to filter unstable edges improves CMD (Circuit-Model Distance), finding circuits that more closely match the full model's preference strength.

Breakthrough Assessment

7/10

Provides a crucial standardization step for the field of interpretability. While it doesn't propose a single new SOTA model, the comparative insights (ensembling, non-linearity) are valuable for future methodology.

⚙️ Technical Details

Problem Definition

Setting: Two tracks: (1) Circuit Localization (finding subgraph C of model N) and (2) Causal Variable Localization (finding mapping F from activations to interpretable concepts).

Inputs: Language models (N), Task datasets with base (b) and counterfactual (c) input pairs.

Outputs: Track 1: A weighted subgraph (circuit). Track 2: Featurizer functions F and F^-1 mapping activations to causal variables.

Pipeline Flow

Participant Method (Circuit Discovery or Causal Variable Learning)
Submission Upload (HuggingFace)
Evaluation Engine (Private Test Set)
Counterfactual Intervention (Patching)
Metric Calculation (CPR/CMD/Faithfulness)

System Modules

Evaluation Engine

Orchestrates the evaluation of submitted circuits/featurizers on withheld test data

Model or implementation: Supports Llama-3.1-8B, Gemma-2-2B, Qwen-2.5-0.5B, GPT-2 Small

Novel Architectural Elements

First community-wide shared task pipeline for Mechanistic Interpretability
Standardized evaluation of non-linear featurizers (MLP-extended DAS)

Modeling

Base Model: Llama-3.1-8B, Gemma-2-2B, Qwen-2.5-0.5B, GPT-2 Small (117M)

Training Method: Methods are applied to pre-trained models (inference-time analysis or training light-weight featurizers)

Adaptation: None (Methods analyze frozen models)

Trainable Parameters: Varies by submission (e.g., DAS rotation matrices, MLP weights for non-linear featurizers)

Training Data:

Tasks: IOI, Arithmetic, MCQA, ARC (Easy/Challenge), RAVEL
Each has Train/Val/Public Test/Private Test splits

Compute: Not reported in the paper

Comparison to Prior Work

vs. EAP-IG: Hybrid ensembling combines EAP variants + pruning to stabilize scores
vs. DAS: Non-linear DAS adds MLP to capture features not linearly separable in activation space
vs. ACDC [not cited in paper]: MIB tracks evaluate fixed circuit sizes rather than iterative pruning until a threshold

Limitations

Evaluation relies on counterfactual faithfulness, which may not capture all aspects of 'explanation' quality.
Time complexity of submitted methods was not directly evaluated or compared.
Non-linear featurizers may be too expressive, potentially memorizing features rather than finding true model variables (similar to probing classifier issues).
Low number of submissions (4 teams total) limits the breadth of conclusions.

Reproducibility

Code: https://hf.co/spaces/mib-bench/leaderboard

Public resources: MIB benchmark code, leaderboards, and public datasets are available. Submission artifacts: Participants uploaded circuits/featurizers to HuggingFace. Missing: Private test sets are withheld to prevent overfitting. Runtimes/compute for submissions were not strictly tracked.

📊 Experiments & Results

Evaluation Setup

Counterfactual intervention analysis on withheld private test sets.

Benchmarks:

Indirect Object Identification (IOI) (Sentence completion)
Arithmetic (2-digit addition/subtraction)
MCQA (Multiple-choice QA (colors))
ARC (Easy & Challenge) (Science QA)
RAVEL (Attribute disentanglement (City/Country))

Metrics:

CPR (Integrated Circuit Performance Ratio)
CMD (Integrated Circuit-Model Distance)
Faithfulness (Accuracy under intervention)
Statistical methodology: Bootstrapping used by one submission (Nikankin et al.) for stability analysis; global significance tests not reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Circuit Localization Track: Comparison of ensembling (Hybrid-Ens) and regularization (PNR-ILP) strategies against standard EAP-IG baselines.
Arithmetic	CPR	0.45	0.58	+0.13
IOI	CMD	11.16	8.31	-2.85
Causal Variable Localization Track: Non-linear featurizers consistently outperform linear DAS baselines.
Arithmetic	Faithfulness	0.78	1.00	+0.22
MCQA	Faithfulness	0.44	0.96	+0.52
RAVEL	Faithfulness	0.64	0.80	+0.16

Main Takeaways

Ensembling attribution methods (Hybrid-Ens) generally yields higher Circuit Performance Ratio (CPR) than individual methods, suggesting complementary signals in gradients vs. activations.
Filtering unstable edges (bootstrapping) improves Circuit-Model Distance (CMD), finding circuits that mimic the model's preference strength more accurately.
Non-linear featurizers (MLPs) drastically outperform linear DAS, especially on complex tasks like Arithmetic and MCQA, though they risk being too expressive.
Performance varies significantly by model family; e.g., GPT-2 Small on IOI is well-solved, while larger models on Arithmetic show more variance between methods.

📚 Prerequisite Knowledge

Prerequisites

Mechanistic Interpretability basics (circuits, features)
Activation/Attribution Patching
Causal intervention analysis

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

MIB: Mechanistic Interpretability Benchmark—the underlying framework used for this shared task

IOI: Indirect Object Identification—a task where the model must identify the indirect object in a sentence (e.g., 'John gave an apple to Mary')

CPR: Integrated Circuit Performance Ratio—metric measuring if a circuit includes components that positively affect task performance (higher is better)

CMD: Integrated Circuit-Model Distance—metric measuring if a circuit yields the same strength of preference as the full model (0 is best)

DAS: Distributed Alignment Search—a method for finding linear subspaces in model activations that correspond to high-level causal variables

SAE: Sparse Autoencoder—an unsupervised method for decomposing model activations into sparse, interpretable features

EAP: Edge Attribution Patching—an efficient approximation of activation patching to estimate the causal importance of edges

IG: Integrated Gradients—an attribution method that aggregates gradients along a path from a baseline to the input to assign importance scores

RAVEL: Resolving Attribute-Value Entanglements in LMs—a dataset evaluating methods for isolating specific attributes of entities

Activation Patching: A technique where a model's internal activation is replaced with an activation from a different input to test its causal role

Faithfulness: A measure of how accurately a circuit or causal variable replicates the full model's behavior under intervention

ARC: AI2 Reasoning Challenge—a dataset of grade-school science questions used to evaluate knowledge and reasoning

MCQA: Multiple-Choice Question Answering—a task format where models select the correct answer from a list of options