← Back to Paper List

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

Md Muntaqim Meherab, Noor Islam S. Mohammad, Faiza Feroz
arXiv (2026)
Reasoning Benchmark

📝 Paper Summary

Mechanistic Interpretability Sparse Autoencoders Causal Abstraction
Causal Concept Graphs combine task-conditioned sparse autoencoders with differentiable structure learning to uncover and validate the causal dependencies between latent concepts in language models without manual annotation.
Core Problem
Mechanistic interpretability tools like Sparse Autoencoders localize features but fail to capture the dynamic, multi-step causal interactions between them as reasoning unfolds.
Why it matters:
  • Without tracing internal reasoning steps, we cannot robustly diagnose failures or distinguish genuine reasoning from shortcut strategies in safety-critical systems
  • Existing model editing methods (e.g., ROME) target single factual associations but are not designed for distributed, compositional reasoning chains
Concrete Example: Current methods might localize the concept of 'France' or 'Paris', but cannot automatically recover the causal chain where activating 'Paris' causally precedes and triggers the activation of 'France' during a retrieval task.
Key Novelty
Unsupervised Causal Graph Learning over Latent Concepts (CCG)
  • Extracts interpretable features from model activations using a task-conditioned Sparse Autoencoder with TopK gating to ensure strict sparsity
  • Learns a Directed Acyclic Graph (DAG) over these features using DAGMA (differentiable structure learning), effectively mapping how concepts causally activate one another
  • Validates the learned structure using a new Causal Fidelity Score (CFS), which measures if intervening on 'parent' nodes induces larger downstream effects than random interventions
Evaluation Highlights
  • CCG achieves a Causal Fidelity Score (CFS) of 5.654 ± 0.625 across three reasoning benchmarks, significantly outperforming ROME-style tracing (3.382)
  • Outperforms SAE-only ranking (CFS 2.479) by ~128%, demonstrating that learned causal structure identifies influential concepts better than activation magnitude alone
  • Recovered graphs exhibit domain-specific topologies: LogiQA graphs are chain-like (sequential), while StrategyQA graphs are dense with hub nodes
Breakthrough Assessment
8/10
Significant advance in unsupervised interpretability, moving from static feature isolation to dynamic causal structure learning. Rigorously validated via interventions, though currently limited to smaller models (GPT-2 Medium).
×