Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

📝 Paper Summary

Mechanistic Interpretability Sparse Autoencoders Causal Abstraction

Causal Concept Graphs combine task-conditioned sparse autoencoders with differentiable structure learning to uncover and validate the causal dependencies between latent concepts in language models without manual annotation.

Core Problem

Mechanistic interpretability tools like Sparse Autoencoders localize features but fail to capture the dynamic, multi-step causal interactions between them as reasoning unfolds.

Why it matters:

Without tracing internal reasoning steps, we cannot robustly diagnose failures or distinguish genuine reasoning from shortcut strategies in safety-critical systems
Existing model editing methods (e.g., ROME) target single factual associations but are not designed for distributed, compositional reasoning chains

Concrete Example: Current methods might localize the concept of 'France' or 'Paris', but cannot automatically recover the causal chain where activating 'Paris' causally precedes and triggers the activation of 'France' during a retrieval task.

Key Novelty

Unsupervised Causal Graph Learning over Latent Concepts (CCG)

Extracts interpretable features from model activations using a task-conditioned Sparse Autoencoder with TopK gating to ensure strict sparsity
Learns a Directed Acyclic Graph (DAG) over these features using DAGMA (differentiable structure learning), effectively mapping how concepts causally activate one another
Validates the learned structure using a new Causal Fidelity Score (CFS), which measures if intervening on 'parent' nodes induces larger downstream effects than random interventions

Evaluation Highlights

CCG achieves a Causal Fidelity Score (CFS) of 5.654 ± 0.625 across three reasoning benchmarks, significantly outperforming ROME-style tracing (3.382)
Outperforms SAE-only ranking (CFS 2.479) by ~128%, demonstrating that learned causal structure identifies influential concepts better than activation magnitude alone
Recovered graphs exhibit domain-specific topologies: LogiQA graphs are chain-like (sequential), while StrategyQA graphs are dense with hub nodes

Breakthrough Assessment

8/10

Significant advance in unsupervised interpretability, moving from static feature isolation to dynamic causal structure learning. Rigorously validated via interventions, though currently limited to smaller models (GPT-2 Medium).

⚙️ Technical Details

Problem Definition

Setting: Recovering a causal graph structure W from residual stream activations observed during reasoning tasks

Inputs: Mean-pooled residual-stream activations h from layer 12 of a frozen GPT-2 Medium model

Outputs: A weighted adjacency matrix W representing the causal DAG between M sparse latent concepts

Pipeline Flow

Feature Extraction: Activations → Sparse Concepts (SAE)
Concept Filtering: Sparse Concepts → Top-M Active Concepts
Structure Learning: Active Concepts → Weighted Adjacency Matrix (DAGMA)
Validation: Matrix → Causal Fidelity Score (Intervention)

System Modules

Task-Conditioned SAE

Decompose dense layer activations into interpretable sparse concept vectors

Model or implementation: Sparse Autoencoder with TopK gating (K=256, k=13)

Causal Graph Learner

Learn the causal dependencies between the most frequent concepts

Model or implementation: Linear SEM optimized via DAGMA

Novel Architectural Elements

Integration of TopK-SAE feature extraction directly with differentiable DAG learning (DAGMA) on internal latent spaces
Task-conditioning of SAEs specifically on reasoning prompts to isolate domain-relevant features

Modeling

Base Model: GPT-2 Medium (24 layers, d=1024, 354.8M parameters)

Training Method: Sparse Dictionary Learning followed by Differentiable Structure Learning (Graph Learning)

Objective Functions:

Purpose: SAE Reconstruction.

Formally: Minimize ||h - h_hat||^2 + lambda * ||OffDiag(Sigma_c)||_F^2 (covariance penalty)
Purpose: Graph Structure Learning.

Formally: Minimize (1/2N)||C - CW||_F^2 + lambda1||W||_1 + (lambda2/2)h(W)^2, where h(W) is the acyclicity constraint

Key Hyperparameters:

SAE_lambda: 5e-2
SAE_beta: 0.1
TopK_k: 13
+ 3 more
CCG_lambda1: 0.02
CCG_lambda2: 0.05
epochs: 300 (CCG training)

Compute: Tesla T4 (15.6 GB)

Comparison to Prior Work

vs. ROME: CCG recovers multi-step, distributed causal graphs rather than localizing single factual edits
vs. SAE-only: CCG adds a learned relational structure (DAG) over features, showing that activation magnitude is a poor proxy for causal influence
vs. CBM: CCG discovers concepts unsupervised from latent space rather than requiring a human-specified concept vocabulary
+ 1 more
vs. TCAV [not cited in paper]: CCG learns concept dependencies automatically, whereas TCAV requires user-defined concept examples to test sensitivity

Limitations

Relies on a linear Structural Equation Model (SEM), which may not capture highly non-linear Transformer interactions
Analysis restricted to a single layer (Layer 12), potentially missing cross-layer causal circuits
Evaluation performed only on GPT-2 Medium; scaling to modern large models (e.g., Llama-3) is unproven
Ablation study on beta-regularization is confounded by a measurement bug (NaN correlations)

Reproducibility

No public code or artifacts provided in the text. Full hyperparameters and training procedure (seeds 42-46) are documented.

📊 Experiments & Results

Evaluation Setup

Causal intervention analysis on reasoning tasks using frozen GPT-2 Medium activations

Benchmarks:

ARC-Challenge (Reasoning/Common Sense)
StrategyQA (Multi-step reasoning)
LogiQA (Logical reasoning)

Metrics:

Causal Fidelity Score (CFS)
Edge Density
DAG Violation
Statistical methodology: One-sided paired t-tests with Bonferroni correction (n=15 paired runs)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CCG consistently outperforms baselines in identifying causally influential nodes across all benchmarks.
Average (ARC, StrategyQA, LogiQA)	Causal Fidelity Score (CFS)	3.382	5.654	+2.272
Average (ARC, StrategyQA, LogiQA)	Causal Fidelity Score (CFS)	2.479	5.654	+3.175
Average (ARC, StrategyQA, LogiQA)	Causal Fidelity Score (CFS)	1.032	5.654	+4.622
Average	Causal Fidelity Score (CFS)	4.2	5.654	+1.454

Main Takeaways

Learned causal graphs significantly outperform both magnitude-based (SAE) and variance-based (ROME) rankings in identifying causally influential features
Graph topologies are domain-specific: LogiQA yields chain-like structures suitable for sequential logic, while StrategyQA yields hub-like structures for gating information
Strict acyclicity constraints are essential; removing them degrades the ability to recover valid causal ordering and reduces intervention fidelity

📚 Prerequisite Knowledge

Prerequisites

Sparse Autoencoders (SAEs) for feature extraction
Causal Structure Learning (DAGs, SEMs)
Mechanistic Interpretability concepts (residual stream, activations)

Key Terms

SAE: Sparse Autoencoder—a neural network used to decompose dense model activations into a sparse set of interpretable features (concepts)

DAGMA: A differentiable algorithm for structure learning that optimizes a continuous acyclicity constraint to learn Directed Acyclic Graphs from data

ROME: Rank-One Model Editing—a method typically used to localize and edit specific factual associations in language models

Causal Fidelity Score (CFS): A metric evaluating if interventions on graph-identified 'parent' nodes cause larger downstream changes in 'child' nodes compared to random interventions

TopK gating: A mechanism that forces exactly K neurons to be active per input, ensuring strict sparsity levels (e.g., 5.1%)

SEM: Structural Equation Model—a statistical model representing causal relationships between variables (here, latent concepts)

Residual stream: The primary vector pathway in Transformer models where information is processed layer by layer

Bonferroni correction: A statistical adjustment made to p-values when performing multiple independent tests to reduce the risk of false positives