Because we have LLMs, we Can and Should Pursue Agentic Interpretability

📝 Paper Summary

Interpretability Human-AI Interaction Human-AI Collaboration

The paper proposes 'agentic interpretability,' where LLMs proactively assist human understanding through multi-turn dialogue by building and leveraging a mental model of the user.

Core Problem

Traditional 'inspective' interpretability treats models as passive objects to be opened up, failing to leverage LLMs' active conversational capabilities to teach humans complex or superhuman concepts.

Why it matters:

Humans risk falling behind in understanding increasingly powerful models if we rely solely on static inspection methods.
Static explanations often fail because they lack a mental model of the user's specific knowledge gaps or context.
Without active guidance, humans may struggle to identify the 'Zone of Proximal Development' where they can learn new machine concepts effectively.

Concrete Example: A user unfamiliar with physics asks about gravitational waves. A standard model might give a generic summary or complex technical jargon. An agentic model infers the user's background, identifies confusion, and proactively simplifies the explanation or uses analogies suited to that specific user, much like a teacher.

Key Novelty

Agentic Interpretability via Mutual Mental Modeling

Defines interpretability as a cooperative agentic process where the model actively builds a mental model of the user to tailor its explanations.
Shifts the goal from just 'debugging' to 'teaching,' potentially enabling humans to learn superhuman concepts (e.g., novel chess strategies) via Socratic dialogue.
Introduces the concept of 'human-entangled-in-the-loop' evaluation, where human responses are integral to the algorithm rather than just external feedback.

Evaluation Highlights

No quantitative experiments reported (Position Paper)
Proposes evaluation proxies: 'Case Improve' (can humans improve model performance using the agent?) and 'Case Learn' (can humans predict model behavior on new inputs after interaction?).

Breakthrough Assessment

9/10

A foundational position paper that redefines interpretability for the LLM era. It shifts the paradigm from passive inspection to active, mutual teaching, addressing how humans can keep up with superhuman AI.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn dialogue between a human user and an LLM (the system)

Inputs: User queries, user feedback, and the model's internal states/outputs

Outputs: Explanations, clarifications, or proactive suggestions to aid human understanding of a concept function f

Pipeline Flow

User engages model
Model infers user mental model
Model proactively explains/guides
User updates mental model

System Modules

Mental Model Builder

Infers and maintains a representation of the user's current knowledge and confusion

Model or implementation: LLM (Implicit or Explicit state)

Proactive Explainer

Generates explanations or suggestions based on the user's mental model to maximize understanding

Model or implementation: LLM

Novel Architectural Elements

Integration of a 'user mental model' as a core driver for generation, rather than just optimizing for correctness or relevance
Shift from stateless QA to stateful pedagogical interaction where the 'state' is the estimated user understanding

Comparison to Prior Work

vs. Mechanistic Interpretability: Agentic focuses on interactive, top-down teaching and user modeling rather than just static circuit identification
vs. Saliency Maps: Agentic provides conversational, context-aware explanations rather than static artifacts
vs. CoT: Agentic is multi-turn and proactive, adapting to the user's confusion rather than just revealing a single reasoning path
+ 1 more
vs. Interactive ML: Focus is on helping the *human* learn the model, not just the model learning from the human

Limitations

May trade off completeness for interactivity, potentially missing deceptive behaviors in high-stakes settings.
Evaluation is difficult due to 'human-entangled-in-the-loop' nature and high variance in user backgrounds.
If the model is deceptive or non-cooperative, it might mislead the user effectively using its mental model of them.
High computational cost for real-time mental modeling and multi-turn dialogue.

Reproducibility

No replication artifacts mentioned in the paper. This is a conceptual position paper without released code or models.

📊 Experiments & Results

Evaluation Setup

Proposed evaluation frameworks (no actual experiments run)

Metrics:

Case Improve: Improvement in model performance after human interaction (proxy for human understanding)
Case Learn: Accuracy of human predictions of model behavior on new inputs (Simulatability)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Agentic interpretability is necessary to bridge the gap between human capabilities and increasingly superhuman AI models.
The 'human-entangled-in-the-loop' nature requires new evaluation paradigms, potentially using LLMs as proxies for humans.
While less suitable for safety-critical auditing of deceptive models (where inspective methods are better), it is crucial for integration and education.
Future work should explore 'model open surgery' where users converse with a model while intervening in its activations.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs)
Familiarity with traditional interpretability methods (saliency maps, probing)
Concept of 'Mental Models' from cognitive science
Theory of Mind

Key Terms

Agentic Interpretability: A method where an AI proactively assists human understanding in a multi-turn conversation by modeling the user's knowledge state.

Inspective Interpretability: Traditional methods that analyze a model's internals (weights, activations) or outputs statically, without interactive dialogue.

Mental Model: An internal representation of an external reality; here, the model's understanding of what the user knows, and the user's understanding of how the model works.

Zone of Proximal Development (ZPD): A concept from psychology describing tasks a learner can do with guidance but not alone; agentic models aim to target this zone.

Human-entangled-in-the-loop: A state where human responses are not just feedback but an integral, inseparable part of the interpretability algorithm's execution.

Superhuman concepts: Knowledge or patterns discovered by the AI that exceed current human understanding (e.g., novel AlphaZero chess moves).

Mechanistic Interpretability: A bottom-up approach to understanding models by reverse-engineering their internal components (neurons, circuits).

Socratic dialogue: A form of cooperative argumentative dialogue to stimulate critical thinking and draw out ideas and underlying presumptions.

Rational Speech Acts (RSA): A framework modeling communication as recursive reasoning about the listener's interpretation of an utterance.