← Back to Paper List

Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models

D Arad, Y Belinkov, H Chen, N Kim, H Mohebbi…
Technion – Israel Institute of Technology, Rice University, Boston University, Tilburg University, University of Groningen, University of Zagreb, Harvard University
Proceedings of the 8th …, 2025 (2025)
Benchmark Reasoning QA

📝 Paper Summary

Mechanistic Interpretability (MI) Circuit Discovery Causal Variable Localization
This shared task evaluates mechanistic interpretability methods by measuring how well they recover causal circuits and interpretable features across four language models and five reasoning tasks.
Core Problem
Systematically comparing mechanistic interpretability methods is difficult due to the lack of standardized frameworks for evaluating how well techniques identify circuits or localize causal variables.
Why it matters:
  • Without standardized benchmarks, it is impossible to determine which interpretability methods faithfully explain model behavior versus providing illusory explanations.
  • Reproducibility in interpretability research is often low, hindering progress in understanding how large language models implement specific behaviors or reasoning steps.
Concrete Example: In the IOI (Indirect Object Identification) task, a model must complete 'John gave an apple to _' with 'Mary'. A circuit discovery method might identify a subgraph responsible for this, but without a benchmark like MIB, we cannot rigorously quantify if this subgraph is minimal and sufficient compared to other methods.
Key Novelty
BlackboxNLP 2025 Shared Task on Mechanistic Interpretability
  • Establishes a standardized competition based on the MIB (Mechanistic Interpretability Benchmark) to evaluate circuit discovery and causal variable localization methods on hidden test sets.
  • Introduces a rigorous evaluation pipeline using counterfactual interventions (activation patching) to measure the faithfulness of discovered circuits and featurized variables.
  • Reveals that ensembling attribution methods and using non-linear projections for features significantly outperform standard baselines.
Evaluation Highlights
  • Hybrid ensembling of attribution methods achieves the highest CPR (Circuit Performance Ratio) across multiple tasks, outperforming individual edge patching methods.
  • Non-linear featurizers (using MLPs) achieve near-perfect faithfulness (0.99-1.0) on Arithmetic tasks with Llama-3.1-8B, significantly outperforming linear DAS baselines.
  • Bootstrapping to filter unstable edges improves CMD (Circuit-Model Distance), finding circuits that more closely match the full model's preference strength.
Breakthrough Assessment
7/10
Provides a crucial standardization step for the field of interpretability. While it doesn't propose a single new SOTA model, the comparative insights (ensembling, non-linearity) are valuable for future methodology.
×