Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning

📝 Paper Summary

Chain-of-Thought (CoT) Optimization Causal Reasoning in LLMs Data Pruning / Dataset Construction

The authors propose a causal framework that quantifies the probability of necessity and sufficiency for individual reasoning steps, enabling the automated pruning of redundant steps and addition of missing ones to create efficient, high-quality Chain-of-Thought data.

Core Problem

Current Chain-of-Thought (CoT) reasoning suffers from two main issues: 'overthinking' (generating redundant, unnecessary steps) and logical gaps (missing steps required for the conclusion).

Why it matters:

Redundant steps increase computational cost and token usage without improving accuracy
Existing pruning methods rely on correlations (e.g., attention weights) rather than true causal impact, often keeping frequent but irrelevant steps
Logical gaps lead to 'hallucinated' correct answers that aren't actually supported by the reasoning chain

Concrete Example: In a GSM-8k problem, a model might correctly calculate an answer but include an irrelevant step like 'I will now double-check the calculation' which doesn't causally contribute to the result. Conversely, it might jump to a conclusion without the necessary intermediate arithmetic, failing the sufficiency test.

Key Novelty

Probability of Necessity and Sufficiency (PNS) for CoT

Adapts Pearl's causal definitions to reasoning chains: a step is 'sufficient' if it guarantees the answer, and 'necessary' if removing it changes the correct answer to an incorrect one
Uses 'counterfactual rollouts': the system temporarily deletes or alters a reasoning step and lets the model continue generating; if the answer flips from correct to incorrect, the step was causally necessary

Architecture

The PNS-based evaluation and optimization pipeline.

Evaluation Highlights

Reduced average token count by ~45% (from 171.8 to 93.4) on GSM-8k while maintaining or improving accuracy
Improved accuracy by +3.4% on AIME 2024 using PNS-optimized CoT data for fine-tuning compared to standard CoT
Achieved 67.2% accuracy on MATH-500 with Qwen2.5-7B-Instruct (SFT), outperforming the base model's 58.8%

Breakthrough Assessment

8/10

Strong theoretical grounding in causal inference applied to CoT. The method successfully reduces token costs while boosting accuracy, a rare 'win-win' in efficiency/performance trade-offs.

⚙️ Technical Details

Problem Definition

Setting: Chain-of-Thought reasoning where an input q generates a sequence of steps S = {s1, ... sn} leading to answer a

Inputs: A question q and an initial (potentially flawed) reasoning chain S

Outputs: An optimized reasoning chain S_final that satisfies causal sufficiency and necessity

Pipeline Flow

Initial CoT Generation
Sufficiency Check (Algorithm 1)
Necessity Estimation (Algorithm 1)
Optimization/Pruning

System Modules

Base Generator

Generates the initial candidate Chain-of-Thought traces

Model or implementation: Various (e.g., Llama-3-8B-Instruct, Qwen2.5-7B-Instruct)

Rollout Model

Generates counterfactual continuations after a step is intervened upon (deleted or altered)

Model or implementation: Same as Base Generator (usually)

PNS Evaluator

Calculates PNS scores via Monte Carlo estimation over k rollouts

Model or implementation: Algorithm (Statistical Estimator)

Novel Architectural Elements

Bi-level optimization framework: Outer loop maximizes chain sufficiency (PS), inner loop maximizes step necessity (PN)
Counterfactual rollout mechanism applied specifically to reasoning steps to determine causal necessity

Modeling

Base Model: Llama-3-8B-Instruct, Qwen2.5-7B-Instruct, Qwen2.5-Math-7B-Instruct

Training Method: Supervised Fine-Tuning (SFT) on PNS-optimized CoT data

Objective Functions:

Purpose: Minimize negative log-likelihood of the optimized reasoning steps.

Formally: Standard language modeling loss.

Training Data:

1,229 high-quality CoT traces selected from training sets of GSM-8k and MATH

Key Hyperparameters:

learning_rate: 5e-6
batch_size: Not explicitly reported in the paper
epochs: 2
+ 2 more
alpha_threshold: Not explicitly reported in the paper
rollout_k: Not explicitly reported in the paper

Compute: Inference performed using VLLM; Training compute not explicitly reported

Comparison to Prior Work

vs. Self-Consistency: PNS actively prunes steps to reduce cost, whereas Self-Consistency increases cost by generating more paths.
vs. Step-Aware Pruning: Prior pruning uses attention/perplexity (correlation); PNS uses counterfactual rollouts (causation).
vs. Reflexion: PNS uses mathematical causal definitions rather than verbal self-critique.

Limitations

Computational cost of constructing the dataset is high due to multiple counterfactual rollouts per step.
The method relies on the base model's ability to generate valid counterfactuals; if the model is too weak, the causal estimation fails.
Approximates sufficiency as binary (0 or 1) based on final answer correctness, which may overlook partial reasoning errors.
Requires an 'answer' to define sufficiency, making it harder to apply to open-ended tasks without ground truth.

Reproducibility

Code: https://github.com/yxn9191/causalmath

Code is publicly available at https://github.com/yxn9191/causalmath. Inference uses VLLM. Specific hyperparameters for the rollout threshold (alpha) and number of Monte Carlo samples (k) are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Mathematical and Commonsense Reasoning Tasks

Benchmarks:

GSM-8k (Grade school math problems)
MATH-500 (Challenging math problems (subset of MATH))
AIME (High-difficulty competition math)
CommonsenseQA (Commonsense reasoning)

Metrics:

Accuracy (Answer Correctness)
Average Token Length (Efficiency)
Probability of Sufficiency (PS)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Efficiency results showing massive reduction in token usage on GSM-8k while maintaining accuracy.
GSM-8k	Average Token Length	171.8	93.4	-78.4
Supervised Fine-Tuning (SFT) results showing that training on PNS-optimized data improves reasoning performance over the base model.
MATH-500	Accuracy	58.8	67.2	+8.4
GSM-8k	Accuracy	78.2	83.6	+5.4
AIME 2024	Accuracy	13.3	16.7	+3.4
GSM-8k	Accuracy	50.6	52.8	+2.2

Experiment Figures

A qualitative comparison of a reasoning trace before and after PNS optimization.

Main Takeaways

PNS-based pruning successfully identifies and removes redundant reasoning steps without hurting accuracy, leading to ~45% token savings.
Models fine-tuned on causally optimized CoT data (SFT) generalize well, showing gains across difficulty levels (from GSM-8k to AIME).
The method works for both In-Context Learning (using optimized shots) and Fine-Tuning, suggesting the model learns the *structure* of necessary reasoning.
Unlike correlation-based pruning, causal pruning ensures that removed steps were truly unnecessary for the final answer.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Causal Inference (Interventions, Counterfactuals)
Monte Carlo estimation

Key Terms

PNS: Probability of Necessity and Sufficiency—a causal metric measuring how likely a specific step is both required for the outcome and capable of producing it

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer

Rollout: A process where the model generates the remainder of a sequence from a specific intervention point (e.g., after deleting a step) to see how the outcome changes

Intervention: Deliberately changing a variable (in this case, a reasoning step) to observe causal effects, denoted as do(S)

ICL: In-Context Learning—providing examples in the prompt to guide model behavior without updating weights

SFT: Supervised Fine-Tuning—updating model weights on a labeled dataset

GSM-8k: A benchmark dataset of grade school math word problems

AIME: American Invitational Mathematics Examination—a benchmark of difficult math competition problems

PN: Probability of Necessity—probability that the correct answer would not have occurred had the step been removed/changed

PS: Probability of Sufficiency—probability that the step guarantees the correct answer