SAE: Sparse Autoencoder—a neural network used to decompose dense model activations into a sparse set of interpretable features (concepts)
DAGMA: A differentiable algorithm for structure learning that optimizes a continuous acyclicity constraint to learn Directed Acyclic Graphs from data
ROME: Rank-One Model Editing—a method typically used to localize and edit specific factual associations in language models
Causal Fidelity Score (CFS): A metric evaluating if interventions on graph-identified 'parent' nodes cause larger downstream changes in 'child' nodes compared to random interventions
TopK gating: A mechanism that forces exactly K neurons to be active per input, ensuring strict sparsity levels (e.g., 5.1%)
SEM: Structural Equation Model—a statistical model representing causal relationships between variables (here, latent concepts)
Residual stream: The primary vector pathway in Transformer models where information is processed layer by layer
Bonferroni correction: A statistical adjustment made to p-values when performing multiple independent tests to reduce the risk of false positives