Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning

📝 Paper Summary

Faithfulness in Large Language Models Chain-of-Thought Reasoning Causal Analysis of LLMs

The paper uses causal mediation analysis to show LLMs often ignore their own reasoning steps and proposes FRODO, a framework that trains separate inference and reasoning modules with counterfactual and preference objectives to force the answer to faithfully follow the reasoning.

Core Problem

Large language models often generate 'unfaithful' Chain-of-Thought explanations, where the final answer does not causally depend on the generated reasoning steps, acting instead as a post-hoc justification.

Why it matters:

If reasoning is merely post-hoc, users cannot trust the model's explanations to diagnose errors or verify safety
Models that ignore their own reasoning steps are less robust to perturbations and generalize poorly to out-of-distribution tasks
Prior methods focus on task performance (accuracy) rather than the causal validity of the reasoning process

Concrete Example: In a causal intervention study, when GPT-4 is provided with a perturbed, counterfactual reasoning chain that logically contradicts its original answer, it faithfully changes its answer only 30% of the time, effectively ignoring the provided logic to stick to its prior bias.

Key Novelty

FRODO (Framework for Reasoning and Optimization with DPO and Objectives)

Decomposes reasoning into two distinct modules: an Inference Module that generates reasoning steps and a Reasoning Module that predicts the answer based on those steps
Uses Causal Mediation Analysis as a training signal, optimizing the Reasoning Module to maximize the 'Indirect Effect' (how much the reasoning actually changes the answer)
Trains the Inference Module using Direct Preference Optimization (DPO) to prefer correct reasoning chains over irrelevant or counterfactual ones without explicit human labeling

Architecture

A Causal Mediation Analysis graph visualizing the relationship between Input (X), Reasoning Chain (R), and Final Answer (Y).

Evaluation Highlights

+2% to +3% absolute accuracy improvement over standard supervised fine-tuning and CoT distillation methods across four reasoning tasks
+4.5% improvement in robustness (faithfulness), measured by how reliably the model alters its answer when conditioned on counterfactual reasoning chains
+2.6% performance improvement on out-of-distribution test sets compared to supervised fine-tuning, indicating better generalization

Breakthrough Assessment

7/10

Novel application of Causal Mediation Analysis not just for evaluation but as a training objective. Addresses the critical 'hallucinated reasoning' problem directly, though improvements are incremental.

⚙️ Technical Details

Problem Definition

Setting: Reasoning task mapping input x to output y via intermediate reasoning steps R

Inputs: Reasoning problem x (e.g., question)

Outputs: Final answer y

Pipeline Flow

Inference Module (Generates candidate reasoning chains)
Reasoning Module (Predicts final answer using reasoning)

System Modules

Inference Module

Generate correct reasoning steps (inference chains) necessary to reach a conclusion

Model or implementation: Small-sized LM (<10B, e.g., Mistral-7B, Llama-2-7B)

Reasoning Module

Robustly use the generated reasoning steps to reach the final conclusion

Model or implementation: Small-sized LM (<10B)

Novel Architectural Elements

Explicit separation of rationalizer (Inference) and predictor (Reasoning) modules optimized for causal dependence
Integration of Causal Mediation Analysis metrics (Indirect Effect) directly into the loss function of the Reasoning Module

Modeling

Base Model: Evaluated on twelve LLMs including Llama-2-7B-Chat, Mistral-Instruct-7B, GPT-3.5-Instruct, GPT-4

Training Method: FRODO (DPO + Causal Objective Fine-tuning)

Objective Functions:

Purpose: Train Inference Module to generate correct reasoning.

Formally: DPO loss L_DPO = -E[log sigma(beta * log(pi(r_w|x)/pi_ref(r_w|x)) - beta * log(pi(r_l|x)/pi_ref(r_l|x)))]
Purpose: Train Reasoning Module to answer correctly (Standard LM).

Formally: L_LM = CrossEntropy(y_gold, y_pred)
Purpose: Force Reasoning Module to change answer if reasoning changes (Counterfactual).

Formally: L_counter = Indirect Effect maximization (using counterfactual chains)
Purpose: Rank correct reasoning/answer pairs higher than counterfactuals.

Formally: L_PREF = max(0, m - (h(x, r_w, y_w) - h(x, r_l, y_w)))

Training Data:

Silver rationales obtained from GPT-3 using in-context learning
Preference pairs created by prompting LLMs to generate correct vs. counterfactual/irrelevant reasoning chains

Compute: Tailors small-sized LMs (<10B parameters)

Comparison to Prior Work

vs. SFT: SFT optimizes P(y|x), while FRODO optimizes the causal chain P(y|R, x) explicitly
vs. CoT Distillation: Distillation copies tokens; FRODO adds causal preference objectives to ensure the tokens are actually used for the prediction
vs. Turpin et al.: Moves beyond measuring unfaithfulness to correcting it via causal training objectives

Limitations

Reliance on larger models (GPT-3/4) to generate silver rationales and counterfactual interventions
Analysis and improvement focused on smaller models (<10B) acting as student models
Requires careful manual curation of intervention data to ensure validity

Reproducibility

Code: https://debjitpaul.github.io/reasoningmatter

Code and data available at https://debjitpaul.github.io/reasoningmatter. Silver rationales generated using GPT-3. Interventions generated using GPT-4.

📊 Experiments & Results

Evaluation Setup

Evaluation on complex reasoning tasks measuring accuracy, robustness to counterfactuals, and OOD generalization.

Benchmarks:

Quarel (Qualitative reasoning)
StrategyQA (Multi-hop reasoning)
OpenBookQA (Commonsense reasoning)
QASC (Scientific reasoning)

Metrics:

Accuracy
Robustness (Faithfulness under intervention)
OOD Generalization (Performance on unseen datasets)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Average across 4 tasks (Quarel, StrategyQA, OpenBookQA, QASC)	Accuracy improvement	Not reported in the paper	Not reported in the paper	+2% to +3%
Average across tasks	Robustness improvement (faithfulness)	Not reported in the paper	Not reported in the paper	+4.5%
Out-of-distribution test sets	Accuracy improvement	Not reported in the paper	Not reported in the paper	+2.6%

Main Takeaways

Instruction-tuned models (e.g., GPT-3.5-Instruct) show stronger causal reliance on reasoning chains than RLHF models (e.g., ChatGPT), which often ignore provided reasoning.
Smaller models are systematically unfaithful in zero-shot settings but can be aligned effectively using the FRODO framework.
FRODO successfully improves the causal link between reasoning and answers, making models more robust to perturbations and better at OOD generalization compared to standard fine-tuning.
GPT-4 is surprisingly stubborn, changing its answer only 30% of the time when provided with counterfactual reasoning that should dictate a different result.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Causal Inference (Mediators, Direct vs. Indirect Effects)
Direct Preference Optimization (DPO)
Supervised Fine-Tuning (SFT)

Key Terms

Causal Mediation Analysis: A statistical method used to decompose the effect of a treatment (input) on an outcome (answer) into direct effects and indirect effects mediated by an intermediate variable (reasoning chain)

Direct Effect: The influence of the input question on the final answer ignoring the reasoning chain (measuring if the model relies on shortcuts/bias)

Indirect Effect: The influence of the reasoning chain on the final answer (measuring if the model actually uses the reasoning to conclude)

DPO: Direct Preference Optimization—an algorithm for aligning language models to preferences using a binary classification loss on pairs of preferred/dispreferred outputs, avoiding a separate reward model

Counterfactual Reasoning: Reasoning based on premise scenarios that are known to be false or modified, used here to test if the model updates its answer when the reasoning logic changes

SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs

FRODO: The proposed framework consisting of an Inference Module (generating CoT) and a Reasoning Module (answering based on CoT)