Agentic Interpretability: A method where an AI proactively assists human understanding in a multi-turn conversation by modeling the user's knowledge state.
Inspective Interpretability: Traditional methods that analyze a model's internals (weights, activations) or outputs statically, without interactive dialogue.
Mental Model: An internal representation of an external reality; here, the model's understanding of what the user knows, and the user's understanding of how the model works.
Zone of Proximal Development (ZPD): A concept from psychology describing tasks a learner can do with guidance but not alone; agentic models aim to target this zone.
Human-entangled-in-the-loop: A state where human responses are not just feedback but an integral, inseparable part of the interpretability algorithm's execution.
Superhuman concepts: Knowledge or patterns discovered by the AI that exceed current human understanding (e.g., novel AlphaZero chess moves).
Mechanistic Interpretability: A bottom-up approach to understanding models by reverse-engineering their internal components (neurons, circuits).
Socratic dialogue: A form of cooperative argumentative dialogue to stimulate critical thinking and draw out ideas and underlying presumptions.
Rational Speech Acts (RSA): A framework modeling communication as recursive reasoning about the listener's interpretation of an utterance.