Faithful Chain-of-Thought Reasoning

📝 Paper Summary

Chain-of-Thought Prompting Neuro-symbolic Reasoning Interpretability

Faithful CoT decouples reasoning into a translation stage (LLM generates a symbolic plan) and a problem-solving stage (deterministic solver executes the plan), guaranteeing the explanation causes the answer.

Core Problem

Standard Chain-of-Thought (CoT) prompting is unfaithful; the generated reasoning text does not necessarily cause the final answer, allowing models to hallucinate reasoning while guessing correct answers or vice versa.

Why it matters:

Unfaithful explanations in high-stakes domains (e.g., legal, medical) mislead users into over-trusting the model based on plausible-looking but causally disconnected reasoning
Standard CoT provides no guarantee that the model's stated logic is the actual mechanism producing the prediction
Existing methods often conflate interpretability (faithfulness) with plausibility (convincingness), failing to provide true transparency

Concrete Example: In a math problem about buying video games (Figure 1), standard CoT correctly calculates the intermediate value '$195' but concludes the final answer is '0', which contradicts its own reasoning chain. The explanation is hallucinated and unrelated to the output.

Key Novelty

Two-Stage Neuro-symbolic Decoupling

Decomposes the reasoning process: The LLM acts solely as a translator from natural language queries to symbolic code (Python, Datalog, PDDL) interleaved with comments
Delegates execution: The final answer is derived strictly by running the generated code with a deterministic external solver, ensuring the reasoning chain is the true cause of the answer

Architecture

Overview of the Faithful CoT framework pipeline

Evaluation Highlights

+21.7 percentage points accuracy gain on Date Understanding (Multi-hop QA) using code-davinci-002 with greedy decoding compared to standard CoT
+14.2% relative accuracy improvement on Math Word Problems (GSM8K) compared to standard CoT, showing that enforcing faithfulness can also improve correctness
Achieves 99.1% accuracy on Relational Inference (CLUTRR) with greedy decoding, outperforming standard CoT (48.5%) by a massive margin

Breakthrough Assessment

8/10

Strongly addresses the critical interpretability flaw of CoT (unfaithfulness) while simultaneously achieving SOTA results across diverse domains (Math, Logic, Planning) via neuro-symbolic integration.

⚙️ Technical Details

Problem Definition

Setting: Few-shot prompting task where a query Q must be mapped to an answer A via a reasoning chain C

Inputs: Natural Language Query Q

Outputs: Reasoning Chain C (interleaved NL and Symbolic Language) and Final Answer A

Pipeline Flow

Translation Stage: LLM translates Query (NL) -> Reasoning Chain (NL + SL)
Problem Solving Stage: External Solver executes Chain (SL) -> Answer

System Modules

Translator

Translate natural language query into a structured reasoning chain containing sub-questions (NL), dependencies, and symbolic code (SL)

Model or implementation: Codex (code-davinci-002) or GPT-4

Deterministic Solver

Execute the symbolic components of the reasoning chain to derive the final answer

Model or implementation: Python Interpreter / Datalog Executor / PDDL Planner

Novel Architectural Elements

Interleaved NL/SL Generation: The prompt structure forces the model to generate Natural Language decomposition (for human readability) alongside Symbolic Language (for execution)
Solver-in-the-loop Inference: The final answer is never generated by the LLM directly; it is strictly the output of the external deterministic engine

Modeling

Base Model: Codex (code-davinci-002)

Reproducibility

Code: https://github.com/veronica320/Faithful-COT

Code, data, and prompts are available at https://github.com/veronica320/Faithful-COT. Note: code-davinci-002 was discontinued by OpenAI in March 2023, affecting exact reproducibility.

📊 Experiments & Results

Evaluation Setup

Few-shot prompting (6-10 shots) on 10 datasets across 4 domains

Benchmarks:

GSM8K (Math Word Problems)
StrategyQA (Multi-hop Question Answering)
SayCan (Planning)
CLUTRR (Relational Inference)
Date Understanding (Symbolic Reasoning (BIG-bench))

Metrics:

Accuracy (Exact Match)
Human-rated Plausibility
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Greedy decoding comparisons show Faithful CoT outperforming Standard CoT on most reasoning tasks, particularly those requiring strict logic or calculation.
GSM8K	Accuracy	63.3	72.3	+9.0
Date Understanding	Accuracy	59.9	81.6	+21.7
CLUTRR	Accuracy	48.5	58.9	+10.4
SayCan	Accuracy	86.4	89.3	+2.9
Self-Consistency decoding (voting over multiple samples) further widens the gap in logic-heavy tasks like relational inference.
CLUTRR	Accuracy	45.7	71.9	+26.2
GSM8K	Accuracy	72.3	21.5	-50.8

Experiment Figures

Ablation study results showing accuracy when removing different parts of the prompt (Rationales, NL, Solver)

Main Takeaways

Faithful CoT consistently outperforms standard CoT and Least-to-Most prompting on 9 out of 10 datasets, particularly on tasks requiring multi-step computation or strict logic
The external solver is crucial; ablation studies show performance collapses on Math and Logic tasks when the solver is removed and the LLM attempts to predict the answer directly
Natural Language comments in the reasoning chain are essential for performance on complex relational tasks (CLUTRR) but less critical for Math (GSM8K), though they always serve the purpose of interpretability
Human evaluation confirms that correct answers largely correlate with correct reasoning chains (90%+ plausibility on most datasets), though exceptions exist where the model gets the right answer with flawed logic

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting paradigm
In-context learning / Few-shot prompting
Basic symbolic logic (PDDL, Datalog, Python)

Key Terms

Faithfulness: In interpretability, the property that an explanation accurately represents the true reasoning process behind a model's prediction

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

Symbolic Language (SL): Formal languages like Python, Datalog, or PDDL used to represent logic and computation deterministically

PDDL: Planning Domain Definition Language—a standard encoding for classical planning problems involving objects, actions, and goals

Datalog: A declarative logic programming language used for querying databases and deductive reasoning

Greedy decoding: A generation strategy where the model selects the highest probability token at each step (temperature=0)

Self-consistency: An ensemble strategy where the model generates multiple reasoning paths and selects the final answer via majority vote

SOTA: State-of-the-Art—the current best performance on a specific benchmark