Evaluation Setup
Zero-shot and few-shot prompting on standard causal benchmarks covering pairwise discovery, full graph discovery, and counterfactual reasoning.
Benchmarks:
- Tübingen Benchmark (Pairwise Causal Discovery (A -> B or B -> A))
- Neuropathic Pain Diagnosis (Full Causal Graph Discovery)
- CRASS (Counterfactual Reasoning Assessment) (Counterfactual Reasoning (Multiple Choice))
- Vignettes (Big Bench & Novel) (Token Causality (Necessary/Sufficient Causes)) [New]
Metrics:
- Accuracy
- F1 Score
- SHD (Structural Hamming Distance)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Results on pairwise causal discovery tasks showing LLM superiority over statistical methods. |
| Tübingen Benchmark |
Accuracy |
83 |
97 |
+14
|
| Results on token causality and counterfactual reasoning tasks. |
| CRASS |
Accuracy |
72 |
92 |
+20
|
| Vignettes |
Accuracy |
Not reported in the paper |
86 |
Not reported in the paper
|
| Results on full graph discovery improving over prior LLM baselines. |
| Neuropathic Pain Diagnosis |
F1 Score (Edges) |
0.21 |
0.68 |
+0.47
|
Main Takeaways
- LLMs effectively capture domain knowledge required for causal discovery, often outperforming data-driven algorithms that struggle with directionality.
- High performance generalizes to novel datasets created after the LLM training cutoff, suggesting capabilities go beyond simple memorization.
- LLMs are particularly strong at identifying necessary and sufficient causes in natural language scenarios (Token Causality).
- While accurate, LLMs should be used to augment human experts or bootstrap causal analysis rather than being trusted blindly due to potential for hallucinations.