To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

📝 Paper Summary

Prompt Engineering Reasoning Evaluation

A comprehensive meta-analysis and empirical study demonstrate that Chain-of-Thought prompting provides significant gains primarily on math and symbolic tasks, yielding negligible benefits over direct answering for commonsense or soft reasoning.

Core Problem

Chain-of-Thought (CoT) is widely applied as a default prompting strategy for all reasoning tasks, increasing inference costs without proven benefits outside of mathematical domains.

Why it matters:

Current LLM deployments (e.g., ChatGPT) often default to CoT, potentially wasting computation on tasks where direct answering is equally effective
Research benchmarks are heavily skewed toward math (GSM8K, MATH), creating a false perception that CoT is universally beneficial for all 'reasoning'
Blindly applying CoT to non-symbolic tasks (like commonsense QA) can sometimes hurt performance or introduce unnecessary latency

Concrete Example: On the MMLU benchmark, CoT provides almost no benefit for generic questions, but for questions containing an equals sign ('=')—indicating symbolic operations—accuracy improves significantly. Directly answering non-math questions yields nearly identical accuracy to CoT.

Key Novelty

Large-Scale CoT Utility Delimitation

Conducts a meta-analysis of over 100 papers and new experiments across 20 datasets to rigorously classify where CoT helps versus where it is redundant
Identifies 'Symbolic Execution' as the primary mechanism for CoT gains, showing that benefits correlate strongly with the presence of formal systems (logic, math) rather than general reasoning
Demonstrates that for symbolic tasks, tool-augmented LLMs (offloading to solvers) outperform CoT, suggesting CoT is a suboptimal middle ground

Architecture

Comparison of CoT utility across different domains (Symbolic vs. Soft Reasoning)

Evaluation Highlights

Meta-analysis of literature shows huge average gains for CoT on Symbolic (+14.2 points) and Math (+12.3 points) tasks compared to direct answering
In contrast, 'Other' task categories in the literature show negligible improvement with CoT (average 56.8) vs Direct Answering (average 56.1)
95% of the total performance gain from CoT on the MMLU benchmark is attributed specifically to questions containing an equals sign ('=')

Breakthrough Assessment

8/10

Provides a crucial corrective to the field's assumption that CoT is a universal reasoning enhancer. The distinction between symbolic execution and soft reasoning is empirically validated and impactful for efficient deployment.

⚙️ Technical Details

Problem Definition

Setting: Evaluating the conditional probability of correct answer generation $P(a|q)$ under different prompting strategies

Inputs: Natural language question q

Outputs: Answer a (either directly generated or extracted from a chain of thought y)

Pipeline Flow

Prompt Formulation (CoT vs Direct)
Inference (vLLM)
Answer Extraction
Evaluation

System Modules

Prompt Formulation

Wrap input questions with specific instructions (e.g., 'Think step by step' for CoT or 'immediately generate' for Direct)

Model or implementation: N/A (Prompt Template)

Inference Engine

Generate model responses using greedy decoding

Model or implementation: 14 contemporary LLMs (including Llama 3.1, Mistral)

Answer Extractor

Parse the final answer from the generated text

Model or implementation: Rule-based parsers

Novel Architectural Elements

Comparative evaluation framework isolating 'Planning' vs 'Execution' phases in symbolic reasoning to analyze CoT utility
Categorization taxonomy splitting reasoning tasks into Symbolic, Mathematical, Logical vs. Commonsense/Soft Reasoning

Modeling

Base Model: 14 models including Llama 3.1, Mistral, and others (evaluated in inference-only mode)

Compute: Not reported in the paper (Evaluation-only study)

Comparison to Prior Work

vs. Standard CoT usage: This work questions the *universality* of CoT, showing it is often redundant for non-symbolic tasks
vs. Tool-use: Shows that for tasks where CoT helps (symbolic), offloading execution to a solver is often even better than CoT
vs. Decomposed Prompting [not cited in paper]: Comparison focuses on single-turn CoT vs Direct, rather than multi-stage decomposition frameworks

Limitations

Analysis primarily focuses on multiple-choice and short-answer formats; long-form generation is less explored
Relies on existing benchmarks which may not perfectly isolate 'soft' vs 'symbolic' reasoning in all edge cases
Does not evaluate training-time CoT (e.g., models fine-tuned specifically to reason), focusing only on prompting
Proprietary datasets mentioned in literature meta-analysis could not be inspected for granular breakdown

Reproducibility

Code: https://github.com/Zayne-sprague/To-CoT-or-not-to-CoT

publicly available (https://github.com/Zayne-sprague/To-CoT-or-not-to-CoT). Prompts and model outputs are uploaded to Huggingface (https://huggingface.co/collections/TAUR-Lab/cot-analysis-project-66bbb9e5e0156e65059895f5). Answer extraction logic is tailored per model/dataset.

📊 Experiments & Results

Evaluation Setup

Zero-shot and Few-shot prompting comparison across diverse reasoning datasets

Benchmarks:

MMLU (General Knowledge & Reasoning)
GSM8K (Grade School Math)
MATH (Challenging Math Problems)
BigBench Hard (BBH) (Algorithmic & Logical Reasoning)
CommonsenseQA (CSQA) (Commonsense Reasoning)

Metrics:

Accuracy
CoT Delta (Accuracy_CoT - Accuracy_Direct)
Statistical methodology: Paired bootstrapping with Bonferroni correction (p-value threshold 0.00027)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Meta-analysis of 110 papers shows CoT benefits are heavily concentrated in math and symbolic domains.
Symbolic Reasoning Tasks	Average Accuracy	45.5	59.7	+14.2
Math Tasks	Average Accuracy	45.5	57.8	+12.3
Non-Symbolic Tasks (Other)	Average Accuracy	56.1	56.8	+0.7
Direct experimental results confirm findings from the literature, with Math/Symbolic tasks showing massive gains.
GSM8K	Accuracy Gain (CoT Delta)	Not reported in the paper	Not reported in the paper	+66.9
MATH	Accuracy Gain (CoT Delta)	Not reported in the paper	Not reported in the paper	+41.6
MMLU	% of Gain Attributed to Math	5	95	+90

Experiment Figures

Distribution of CoT deltas (gains) across task categories from the literature meta-analysis

Experimental CoT improvements averaged across 14 models for specific datasets

Main Takeaways

CoT is not a universal reasoning enhancer; it functions primarily as a mechanism for symbolic execution and planning.
On 'soft reasoning' tasks (commonsense, reading comprehension), CoT performs identically to direct answering, implying the 'thinking' steps are unnecessary.
Symbolic solvers (tool-use) outperform CoT for execution-heavy tasks, suggesting CoT is a weaker substitute for formal tools.
Efficiency implication: Direct answering should be the default for non-math tasks to save inference costs without performance loss.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Large Language Models (LLMs) and prompting
Understanding of Chain-of-Thought (CoT) vs. Direct Answering
Knowledge of common reasoning benchmarks (GSM8K, MMLU)

Key Terms

CoT: Chain-of-Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer

Direct Answering: A prompting strategy where the model is instructed to output the final answer immediately without intermediate reasoning steps

MMLU: Massive Multitask Language Understanding—a broad benchmark covering STEM, humanities, and social sciences

GSM8K: Grade School Math 8K—a benchmark of grade-school level mathematics word problems

Symbolic Reasoning: Problems grounded in a formal system (e.g., math, logic, code) where a symbolic expression can be derived and solved

Soft Reasoning: Problems relying on commonsense or natural language inference where no formal logical system or strict ruleset exists to derive the answer

vLLM: A high-throughput and memory-efficient inference engine for LLMs

SFT: Supervised Fine-Tuning