← Back to Paper List

Because we have LLMs, we Can and Should Pursue Agentic Interpretability

Been Kim, John Hewitt, Neel Nanda, Noah Fiedel, Oyvind Tafjord
Google DeepMind
arXiv.org (2025)
Agent Reasoning Benchmark

📝 Paper Summary

Interpretability Human-AI Interaction Human-AI Collaboration
The paper proposes 'agentic interpretability,' where LLMs proactively assist human understanding through multi-turn dialogue by building and leveraging a mental model of the user.
Core Problem
Traditional 'inspective' interpretability treats models as passive objects to be opened up, failing to leverage LLMs' active conversational capabilities to teach humans complex or superhuman concepts.
Why it matters:
  • Humans risk falling behind in understanding increasingly powerful models if we rely solely on static inspection methods.
  • Static explanations often fail because they lack a mental model of the user's specific knowledge gaps or context.
  • Without active guidance, humans may struggle to identify the 'Zone of Proximal Development' where they can learn new machine concepts effectively.
Concrete Example: A user unfamiliar with physics asks about gravitational waves. A standard model might give a generic summary or complex technical jargon. An agentic model infers the user's background, identifies confusion, and proactively simplifies the explanation or uses analogies suited to that specific user, much like a teacher.
Key Novelty
Agentic Interpretability via Mutual Mental Modeling
  • Defines interpretability as a cooperative agentic process where the model actively builds a mental model of the user to tailor its explanations.
  • Shifts the goal from just 'debugging' to 'teaching,' potentially enabling humans to learn superhuman concepts (e.g., novel chess strategies) via Socratic dialogue.
  • Introduces the concept of 'human-entangled-in-the-loop' evaluation, where human responses are integral to the algorithm rather than just external feedback.
Evaluation Highlights
  • No quantitative experiments reported (Position Paper)
  • Proposes evaluation proxies: 'Case Improve' (can humans improve model performance using the agent?) and 'Case Learn' (can humans predict model behavior on new inputs after interaction?).
Breakthrough Assessment
9/10
A foundational position paper that redefines interpretability for the LLM era. It shifts the paradigm from passive inspection to active, mutual teaching, addressing how humans can keep up with superhuman AI.
×