Mitigating Spurious Correlations in LLMs via Causality-Aware Post-Training

📝 Paper Summary

LLM Reasoning Spurious Correlations Causal Inference

CAPT mitigates reasoning failures by decomposing prediction into event estimation and symbolic intervention, stripping away spurious correlations while preserving logical structure.

Core Problem

LLMs rely on spurious correlations (shortcuts) learned during pre-training, causing catastrophic failure on out-of-distribution (OOD) reasoning tasks where surface features change but logic remains constant.

Why it matters:

Models fail on domain-specific tasks (e.g., formal causal inference) when entity names are perturbed, even if the underlying logic is identical
Standard fine-tuning on small datasets often reinforces pre-existing biases or introduces new selection biases rather than teaching true reasoning structure
Existing bias mitigation focuses on entity bias (e.g., names) but underexplores event-level bias (e.g., 'alarm is set') crucial for complex reasoning

Concrete Example: In PrOntoQA, GPT-4o-mini achieves 83.5% on commonsense queries but drops to 61.25% on 'anti-sense' queries where rules contradict prior knowledge (e.g., fictitious rules about alarms), revealing reliance on pre-trained semantic shortcuts.

Key Novelty

Causality-Aware Post-Training (CAPT)

Decomposes the prediction process: leverages the LLM's strong general knowledge to estimate events, then intervenes to replace them with abstract placeholders
Applies 'Event Intervention' by mapping specific events (e.g., 'Husband sets alarm') to neutral symbols ({symbol_1}) and finally to random letters, blocking semantic shortcuts
Enforces learning of the invariant logical structure $S$ rather than surface correlations between events $E$ and answers $Y$

Architecture

The CAPT implementation pipeline comprising Data Transformation (Training) and Inference adaptation

Evaluation Highlights

+11.75% accuracy improvement on PrOntoQA OOD (Anti-sense) for Qwen2.5-3B using CAPT with CoT (100 samples) compared to standard CoT fine-tuning
+9.13% accuracy improvement on CLadder OOD (Anti-sense) for Qwen2.5-3B using CAPT with CoT (100 samples) compared to standard CoT fine-tuning
Significantly reduced performance standard deviation across distributions (ID vs. OOD), dropping from ~14.8 to 3.4 on PrOntoQA

Breakthrough Assessment

7/10

Simple, theoretically grounded approach that effectively decouples semantic bias from logical reasoning. Strong sample efficiency, though relies on a separate stronger model for the transformation step.

⚙️ Technical Details

Problem Definition

Setting: Robust reasoning under distribution shift, modeled via a Structural Causal Model where Input $X$ is a collider between Events $E$ and Structure $S$

Inputs: Natural language reasoning prompt $X$ (e.g., causal query or logical deduction problem)

Outputs: Reasoning trace and final binary answer $Y$ (Yes/No)

Pipeline Flow

Data Transformation: Input Prompt -> Event Estimation -> Event Intervention -> Randomized Assignment
Reasoning: Transformed Prompt -> Fine-tuned Model -> CoT -> Answer

System Modules

Event Transformer (Input Processing)

Identify events in the prompt and replace them with abstract placeholders

Model or implementation: GPT-4o-mini (frozen, used for data processing)

Randomizer (Input Processing)

Map abstract placeholders to random capital letters to ensure permutation invariance

Model or implementation: Rule-based script

Reasoning Model

Generate the reasoning trace and final answer based on the anonymized structure

Model or implementation: Qwen2.5-3B (fine-tuned)

Novel Architectural Elements

Inference-time 'Event Intervention' pipeline that dynamically anonymizes OOD inputs using a stronger helper model to match the ID fine-tuning distribution

Modeling

Base Model: Qwen2.5-3B

Training Method: Supervised Fine-Tuning (SFT) on CAPT-transformed data

Training Data:

Data Triplets: (Original Prompt, CoT Trace, Answer)
Transformed via GPT-4o-mini into (Anonymized Prompt, Anonymized CoT, Answer)
Sample sizes: 100 and 200 samples tested

Compute: Not reported in the paper

Comparison to Prior Work

vs. CausalCOT: CAPT modifies the *input* distribution to remove bias before reasoning, whereas CausalCOT focuses only on the reasoning format
vs. Logic-LM: CAPT keeps the reasoning latent within the LLM (via CoT) rather than offloading to a deterministic solver
vs. Standard SFT: CAPT actively removes semantic information to prevent overfitting to spurious correlations, enabling 100-shot generalization

Limitations

Relies on a capable external model (GPT-4o-mini) or oracle for accurate event estimation during inference
Assumption that 'Event Estimation' is a perfectly transferable capability may fail on extremely obscure domains
The method increases inference latency due to the multi-step transformation process (Estimate -> Intervene -> Reason)

Reproducibility

No code provided. Implementation relies on GPT-4o-mini for data transformation (prompt templates not explicitly provided in full, though described conceptually). Qwen2.5-3B is open weights. Specific fine-tuning hyperparameters (LR, batch size) are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Binary classification (Yes/No) on causal and logical reasoning tasks

Benchmarks:

CLadder (Formal Causal Inference (Associational, Interventional, Counterfactual))
PrOntoQA (Logical Deductive Reasoning)

Metrics:

Accuracy
Standard Deviation (STD) across ID/OOD sets
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on PrOntoQA (Logical Reasoning) with 100 training samples. 'Anti-sense' is the key OOD metric where labels contradict commonsense.
PrOntoQA	Accuracy (Anti-sense)	70.75	82.50	+11.75
PrOntoQA	Accuracy (Commonsense)	99.50	87.50	-12.00
Results on CLadder (Causal Inference) with 100 training samples.
CLadder	Accuracy (Anti-sense)	68.85	77.98	+9.13
CLadder	Accuracy (Commonsense)	67.98	75.96	+7.98

Main Takeaways

CAPT significantly reduces the variance (Standard Deviation) of performance across different distributions (Commonsense, Anti-sense, Non-sense), indicating more stable reasoning
Sample efficiency is high: 3B models fine-tuned with CAPT on just 100 samples can outperform larger models (GPT-4o) on OOD tasks
The method is particularly effective on 'Anti-sense' datasets where pre-training bias actively hurts performance
Combining CAPT with Chain-of-Thought (CoT) yields better results than Answer-only training, confirming that explicit reasoning traces help approximate the underlying causal structure

📚 Prerequisite Knowledge

Prerequisites

Structural Causal Models (SCM)
Bayesian inference concepts (colliders, confounders)
Large Language Model fine-tuning (SFT)

Key Terms

Spurious Correlation: A statistical relationship between variables that is not causal, often caused by a common confounder or selection bias

Structural Causal Model (SCM): A framework describing the causal mechanisms of a system using variables and structural equations

Collider: A variable in a causal graph that is influenced by two or more other variables (e.g., $E \rightarrow X \leftarrow S$)

Confounder: A variable that influences both the dependent and independent variables, causing a spurious association

Backdoor Adjustment: A causal inference technique to estimate causal effects by blocking non-causal paths (confounding)

Event Estimation: The process of identifying and extracting distinct events (e.g., 'Husband sets alarm') from a natural language prompt

Event Intervention: Replacing identified events with abstract symbols to break semantic associations

SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, task-specific dataset

CoT: Chain-of-Thought—prompting the model to generate intermediate reasoning steps before the final answer