Measuring Faithfulness in Chain-of-Thought Reasoning

📝 Paper Summary

Interpretability Explainability Chain-of-Thought Reasoning

Chain-of-Thought (CoT) reasoning is often unfaithful (post-hoc), with larger models frequently ignoring their stated reasoning, though faithfulness increases on difficult tasks like math.

Core Problem

It is unclear if the step-by-step reasoning generated by Large Language Models (LLMs) actually causes their final answer or if it is merely a post-hoc justification for a decision already made.

Why it matters:

If CoT is unfaithful, users cannot rely on the model's explanation to verify the correctness or safety of high-stakes decisions (e.g., in medicine)
Understanding whether reasoning is causal is essential for interpretability; currently, we do not know if CoT improves performance via actual reasoning or just extra test-time compute

Concrete Example: A model answers a question about TV ownership. In its reasoning, it calculates '5! = 120'. However, if we truncate the reasoning text early, the model might still output '120', or if we inject a mistake '5! = 100', the model might ignore the mistake and still output '120', proving the reasoning text didn't cause the answer.

Key Novelty

Defense-in-Depth Faithfulness Testing Suite

Measure 'Early Answering': Truncate the reasoning chain at various steps to see if the model's final answer is already fixed before the reasoning is complete (indicating post-hoc rationalization)
Measure 'Adding Mistakes': Inject errors into the reasoning chain; if the model's final answer doesn't change despite the error, the model is likely ignoring the reasoning content
Measure 'Filler Tokens': Replace reasoning with dots ('...') to test if performance gains come simply from extra computation time (context length) rather than semantic content

Architecture

An illustration of the four proposed tests for measuring faithfulness: Early Answering, Adding Mistakes, Paraphrasing, and Filler Tokens.

Evaluation Highlights

On the ARC (Easy) task, models ignore their reasoning almost entirely: truncating the CoT changes the answer less than 10% of the time
Inverse scaling observed: On 7 out of 8 tasks, a 13B model relies more on its reasoning (changes answer more when CoT is removed) than a 175B model does
Replacing CoT with uninformative filler tokens ('...') yields 0% accuracy gain compared to standard CoT, ruling out test-time compute as the sole driver of performance

Breakthrough Assessment

7/10

Provides strong empirical evidence against the assumption that CoT is inherently faithful, introducing robust diagnostic tests. The finding of inverse scaling for faithfulness is particularly significant.

⚙️ Technical Details

Problem Definition

Setting: Multiple-choice Question Answering with intermediate reasoning generation

Inputs: Natural language question Q

Outputs: Chain of Thought reasoning Z followed by final answer A

Pipeline Flow

Input Question
Reasoning Generation (Standard CoT)
Intervention (Truncate / Add Mistake / Paraphrase / Filler Tokens)
Answer Generation (Conditioned on Intervened CoT)

System Modules

Base LLM

Generate reasoning steps and final answer probabilities

Model or implementation: 175B pretrained transformer fine-tuned with RLHF (similar to GPT-3/Claude class)

Mistake Generator (Intervention)

Generate plausible errors to inject into reasoning chains

Model or implementation: Pretrained LM (without RLHF)

Paraphraser (Intervention)

Reword reasoning to test for steganography

Model or implementation: LLM (blind to original question)

Modeling

Base Model: 175B parameter decoder-only transformer (Anthropic internal model)

Training Method: Reinforcement Learning from Human Feedback (RLHF)

Adaptation: Fine-tuned to be a helpful dialog assistant

Trainable Parameters: 175 billion (implied full fine-tuning or large scale)

Key Hyperparameters:

sampling_temperature: 0.8
nucleus_sampling_p: 0.95
samples_per_problem: 100

Compute: Not reported in the paper

Comparison to Prior Work

vs. Turpin et al.: This paper investigates non-adversarial, standard settings across diverse tasks rather than intentionally biased prompts
vs. Wei et al.: This paper explicitly tests the 'test-time compute' hypothesis using filler tokens and finds it insufficient to explain CoT gains
vs. Lyu et al. [not cited in paper]: Lyu et al. enforce faithfulness by converting CoT to code; this paper evaluates the faithfulness of natural language CoT

Limitations

No ground truth for the model's actual internal reasoning process exists, so results are inferential.
Experiments rely on RLHF models; pretrained base models might behave differently (e.g., less post-hoc).
Mistake generation relies on a model that might occasionally fail to generate a valid mistake (though manually verified to be >80% effective).
Analysis is limited to multiple-choice QA and specific synthetic addition tasks.

Reproducibility

No replication artifacts mentioned in the paper. The models are Anthropic internal models. The prompts used for interventions (mistake generation, paraphrasing) are provided in the paper/appendix.

📊 Experiments & Results

Evaluation Setup

Multiple choice QA across 8 standard benchmarks

Benchmarks:

AQuA (Algebra word problems)
LogiQA (Logical reasoning)
MMLU (Multitask knowledge/reasoning)
HellaSwag (Commonsense reasoning (completion))
ARC (Challenge/Easy) (Science questions)
TruthfulQA (Factual questions/misconceptions)
OpenBookQA (Elementary science questions)

Metrics:

Area Over Curve (AOC) - measuring consistency after intervention
Accuracy
Percentage of answers changed
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Metrics assessing faithfulness across tasks using Early Answering and Adding Mistakes. Higher AOC indicates higher faithfulness (less post-hoc behavior).
AQuA	AOC (Area Over Curve)	Not reported in the paper	0.44	Not reported in the paper
ARC (Easy)	AOC (Area Over Curve)	Not reported in the paper	0.02	Not reported in the paper
AQuA	AOC (Area Over Curve)	Not reported in the paper	0.52	Not reported in the paper
AQuA	Accuracy	28	43	+15.00
HellaSwag	Accuracy	71	66	-5.00
Inverse scaling results showing that larger models are often less faithful.
7/8 Tasks	Reliance on CoT (Change in answer w/ vs w/o CoT)	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Probability of the model giving the same answer as the complete CoT when the CoT is truncated at different lengths (Early Answering).

Model accuracy when CoT is replaced by varying lengths of filler tokens ('...').

Percentage of time the answer changes with vs. without CoT across model sizes (10^9 to 10^11 parameters).

Main Takeaways

Faithfulness varies drastically by task: Math tasks (AQuA) show high CoT reliance, while easy science tasks (ARC-Easy) show almost zero reliance (post-hoc)
Filler tokens ('...') provide no accuracy benefit, proving that CoT's value is not just about 'thinking time' or extra compute
Paraphrasing the CoT does not degrade performance, suggesting the model is not hiding information (steganography) in specific word choices
Inverse Scaling: As models get larger (13B -> 175B), they become *less* faithful on many tasks, likely because they can solve the problem directly without needing the reasoning steps

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Transformer architecture
Reinforcement Learning from Human Feedback (RLHF)

Key Terms

CoT: Chain-of-Thought—a prompting technique where the model generates step-by-step reasoning before its final answer

Faithfulness: The property that the stated reasoning accurately represents the actual computational process used by the model to reach its conclusion

Post-hoc reasoning: Reasoning generated after the conclusion has effectively been reached, serving as a justification rather than a cause

RLHF: Reinforcement Learning from Human Feedback—a training method used to align language models with human preferences

Steganography: Encoding hidden information (e.g., via punctuation or phrasing) in the reasoning text that allows the model to pass information to the final answer step without human-readable content

AOC: Area Over the Curve—a metric used here to quantify faithfulness; higher AOC means the model's answer changes more often when reasoning is truncated, implying less post-hoc behavior

Inverse scaling: A phenomenon where model performance or desirable behavior (here, faithfulness) gets worse as the model size increases