Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps

📝 Paper Summary

Chain of Thought (CoT) Faithfulness Mechanistic Interpretability Machine Unlearning

The paper introduces a framework that measures whether a model's reasoning is faithful by 'unlearning' specific reasoning steps from the model's parameters and observing if the prediction changes.

Core Problem

Current methods for evaluating Chain of Thought (CoT) faithfulness typically rely on context perturbations (removing text), which measure self-consistency rather than whether the model's parameters actually rely on that reasoning.

Why it matters:

Models can produce plausible CoT reasoning that is actually post-hoc and disconnected from the true internal computation causing the answer
Contextual perturbations are insufficient because models might recover the 'missing' information from their internal parameters, confounding faithfulness measurements
Understanding the true causal link between reasoning and prediction is essential for building reliable and transparent AI systems

Concrete Example: If a model answers 'Team A' because it reasoned 'Player X plays for Team A', removing that reasoning text from the prompt (contextual perturbation) might not change the answer if the model already 'knows' the fact in its weights. However, if we 'unlearn' that specific fact from the weights and the answer *does* change, we know the reasoning was parametrically faithful.

Key Novelty

Parametric Faithfulness Framework (pff) instantiated via Faithfulness by Unlearning Reasoning (fur)

Shifts from context-based interventions (editing the prompt) to parameter-based interventions (editing the weights) to measure faithfulness
Uses machine unlearning (NPO) to surgically erase the information contained in a generated reasoning step from the model
If erasing the reasoning step from the parameters causes the model's prediction to flip, the reasoning step is deemed 'parametrically faithful'

Architecture

Conceptual diagram of the Parametric Faithfulness Framework (pff) and the specific 'fur' instance

Evaluation Highlights

Demonstrates that unlearning specific reasoning steps frequently changes the model's prediction, confirming those steps were parametrically faithful
Finding: Humans do not necessarily consider the steps identified as 'faithful' by this method to be 'plausible', indicating a gap between how models reason and how humans expect them to
Finding: Unlearning a faithful step often leads the model to generate a new CoT supporting a completely different answer, hinting at a deep causal effect

Breakthrough Assessment

7/10

A novel methodological shift from input-level to parameter-level analysis for CoT faithfulness. While the 'unlearning' technique is standard, applying it to verify reasoning causality is a clever and stronger test than existing ablation methods.

⚙️ Technical Details

Problem Definition

Setting: Measuring the causal dependence of a model's prediction y on a generated reasoning step r_i

Inputs: A Language Model M, a question q, and a generated Chain of Thought (CoT)

Outputs: Faithfulness scores (ff-hard, ff-soft) indicating if the reasoning causally drives the prediction

Pipeline Flow

Generation: Model M generates CoT and answer
Segmentation: CoT is split into individual reasoning steps
Intervention: Unlearn a specific step r_i from M to create M*
Evaluation: Compare predictions of M and M* to calculate faithfulness

System Modules

CoT Generator

Generate the initial reasoning chain and prediction

Model or implementation: Target LM (e.g., Llama variants)

Unlearning Module

Erase the knowledge of a specific reasoning step from parameters

Model or implementation: NPO+KL Optimizer

Faithfulness Evaluator

Quantify impact of unlearning on prediction

Model or implementation: Inference on M*

Novel Architectural Elements

Integration of Machine Unlearning (NPO) as an interpretability probe within a faithfulness evaluation pipeline

Modeling

Base Model: Four LMs (specific names not listed in provided text snippet, but likely includes Llama-3-8B-Instruct based on GitHub link context)

Training Method: Negative Preference Optimization (NPO) with KL regularization

Objective Functions:

Purpose: Discourage the model from predicting the tokens of the target reasoning step given its prefix.

Formally: NPO loss on Forget Set (D_FG)
Purpose: Maintain model fluency and general capabilities.

Formally: KL divergence between original M and unlearned M* on Retain Set (D_RT)

Trainable Parameters: Second FF2 matrix of the Transformer MLPs (memory store)

Training Data:

Forget Set: Input-output pairs of the target reasoning step (prefix -> content word)
Retain Set: Content words from 4 randomly selected CoT steps from other instances

Key Hyperparameters:

unlearning_iterations: 55
retain_batch_size: 4 random instances

Compute: Not reported in the paper

Comparison to Prior Work

vs. Contextual Perturbation: 'fur' removes information from parameters, not just context, preventing the model from recovering knowledge from weights
vs. Activation Patching: 'fur' makes permanent weight changes to assess global model behavior change, rather than transient activation noise
vs. ROME/MEMIT: 'fur' uses NPO which handles unstructured text/reasoning better than ROME/MEMIT which require structured (subject, relation, object) tuples

Limitations

Unlearning might not be perfectly precise; specificity controls are needed to ensure other knowledge isn't damaged
ff-hard is a lower bound; a step might be faithful but redundant (multiple paths to answer), so unlearning it doesn't flip the answer
Evaluation relies on the assumption that NPO successfully erases the target concept without side effects

Reproducibility

Code: https://github.com/technion-cs-nlp/parametric-faithfulness

Code is publicly available at https://github.com/technion-cs-nlp/parametric-faithfulness. Hyperparameters for NPO are stated (55 iterations, targeting FF2 layer). Specific model names and dataset details are mentioned as 'four LMs' and 'five MCQA datasets' in the provided text.

📊 Experiments & Results

Evaluation Setup

Multi-hop Multi-choice Question Answering (MCQA)

Benchmarks:

5 MCQA datasets (including Sports) (Multi-hop reasoning)

Metrics:

Efficacy (reduction in probability of unlearned step)
Specificity (accuracy on held-out in-domain instances)
ff-hard (percentage of answer flips after unlearning)
ff-soft (probability mass shift)
MMLU (zero-shot, to check general capability retention)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Unlearning reasoning steps via 'fur' is feasible: it reduces the probability of the target step (efficacy) while largely maintaining performance on held-out data (specificity)
Parametric faithfulness exists: Unlearning specific reasoning steps frequently causes the model to change its answer, proving a causal link between the reasoning parameters and the prediction
Reasoning plasticity: After unlearning a faithful step, the model often generates a new CoT that supports the *new* (different) answer, suggesting the intervention fundamentally alters the reasoning path
Plausibility gap: Reasoning steps identified as parametrically faithful by 'fur' are not always rated as plausible by humans, highlighting a misalignment between human intuition and model mechanism

📚 Prerequisite Knowledge

Prerequisites

Chain of Thought (CoT) prompting
Machine Unlearning concepts (Forget set vs. Retain set)
Language Model fine-tuning (specifically MLP layers)

Key Terms

pff: Parametric Faithfulness Framework—the general methodology of intervening on model parameters to test reasoning faithfulness

fur: Faithfulness by Unlearning Reasoning steps—the specific instance of pff using NPO to unlearn steps

NPO: Negative Preference Optimization—an unlearning loss function that discourages the model from generating specific 'forget' sequences

CoT: Chain of Thought—intermediate reasoning steps generated by a model before its final answer

Parametric Faithfulness: Whether the reasoning chain accurately reflects the internal computations (parameters) used to derive the answer

Contextual Faithfulness: Whether the model's answer is consistent with the provided reasoning context (measured by editing the prompt)

MCQA: Multi-choice Question Answering—the task format used for evaluation

ff-hard: A binary metric indicating if unlearning a reasoning chain causes the model's answer to flip

ff-soft: A continuous metric measuring the probability mass shift away from the original answer after unlearning