Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse

📝 Paper Summary

Chain-of-Thought (CoT) Analysis Cognitive Modeling in AI LLM Evaluation

Drawing on cognitive psychology, this paper identifies that tasks where verbal deliberation impairs human performance—such as implicit statistical learning and face recognition—systematically cause Chain-of-Thought to degrade LLM performance.

Core Problem

Chain-of-Thought (CoT) is widely applied as a default performance booster, but it systematically reduces accuracy in certain settings, and researchers lack heuristics to predict when these failures will occur.

Why it matters:

Inference-time reasoning is becoming standard in frontier models (e.g., OpenAI o1), creating a risk of deploying models that perform worse than their predecessors on specific tasks
Exhaustively testing the vast space of potential tasks to find CoT failure modes is intractable without guiding heuristics
Current benchmarks prioritize symbolic reasoning where CoT excels, masking its detrimental effects on tasks requiring implicit or non-verbal processing

Concrete Example: In an artificial grammar learning task, the reasoning-heavy model o1-preview achieves only 58.64% accuracy because it tries to verbally derive rules, while GPT-4o achieves 94.95% accuracy zero-shot by relying on implicit pattern matching.

Key Novelty

The Human-Overthinking Heuristic

Proposes a heuristic based on cognitive psychology: if verbal deliberation ('overthinking') hurts human performance on a task, CoT will likely hurt LLM performance on that same task
Adapts six classic psychological experiments (e.g., verbal overshadowing, rule-following with exceptions) into large-scale benchmarks to valididate this parallel

Architecture

Conceptual diagram of the six task archetypes derived from psychology literature used to evaluate the impact of CoT

Evaluation Highlights

-36.3% absolute accuracy drop for OpenAI o1-preview compared to GPT-4o zero-shot on the implicit statistical learning (artificial grammar) task
CoT increases the number of training passes needed to learn rules with exceptions by up to 331% (from ~3 to ~13 passes) for GPT-4o compared to direct prompting
CoT reduces face recognition accuracy across all six Vision-Language Models tested, often dropping performance to near random chance

Breakthrough Assessment

8/10

Provides a novel, scientifically grounded heuristic for predicting CoT failures, offering a crucial counter-narrative to the 'reasoning always helps' trend. The empirical results on o1-preview are particularly striking.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of LLM/LMM performance on tasks categorized by the impact of verbal deliberation on human subjects

Inputs: Task instances from 6 archetypes (e.g., image pairs for face recognition, strings for grammar classification)

Outputs: Classification labels (accuracy) or number of iterations to convergence

Pipeline Flow

Task Input (Psychological Archetype)
Prompt Strategy (Zero-shot vs. CoT)
Model Inference
Evaluation Metric Calculation

System Modules

Task Adaptation

Scale up classic psychological tasks (often N=1 in original studies) to thousands of synthetic examples for robust LLM evaluation

Model or implementation: Various data generation scripts (e.g., Stable Image Ultra for faces, synthetic grammar generators)

Prompt Strategy

Control the reasoning mode of the model

Model or implementation: Prompt Templates

Model Inference

Generate predictions under specified conditions

Model or implementation: GPT-4o, Claude 3.5 Sonnet, o1-preview, etc.

Modeling

Base Model: Evaluation covers multiple models: OpenAI o1-preview, GPT-4o, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, Llama 3.1 70B Instruct, InternVL2

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard CoT: This paper evaluates *failure cases* of CoT rather than proposing a new method. It compares CoT against Zero-shot baselines to demonstrate performance degradation.

Limitations

The heuristic is not a perfect predictor; differences in memory (context window vs. working memory) and priors (logic training data) can decouple human and model performance
Distractor tasks used in human studies were removed for LMMs due to model weakness, potentially altering task difficulty
Cost limitations restricted the evaluation of o1-preview to a subset of the dataset

Reproducibility

Code: https://github.com/JiayiGeng/CoT_overthinking

publicly available (https://github.com/JiayiGeng/CoT_overthinking). The benchmark datasets and code are released. Model weights for closed-source models (GPT-4o, Claude) are not available, but open weights models (Llama 3.1) are evaluated.

📊 Experiments & Results

Evaluation Setup

Comparative evaluation of zero-shot vs. Chain-of-Thought prompting across 6 task archetypes derived from cognitive psychology

Benchmarks:

Implicit Statistical Learning (Grammar) (Binary classification of strings based on artificial grammar) [New]
Verbal Overshadowing (Faces) (Face recognition from descriptions/visuals) [New]
Exceptions to Rules (Vehicles) (Multi-turn classification learning with feedback) [New]
Logical Inconsistency (Identifying logical contradictions) [New]

Metrics:

Accuracy
Number of passes to convergence (learning efficiency)
Statistical methodology: Mentions results are statistically significant but specific test details are not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CoT dramatically reduces performance on Implicit Statistical Learning tasks, where pattern matching outweighs explicit rule formulation.
Implicit Statistical Learning (Grammar)	Accuracy	94.95	58.64	-36.31
CoT impairs performance on Verbal Overshadowing tasks, mirroring human difficulty in verbalizing fine-grained visual details.
Verbal Overshadowing (Faces)	Accuracy	62.20	50.80	-11.40
CoT hinders learning when data contains exceptions to simple rules, leading to inefficient hypothesis testing.
Exceptions to Rules (Vehicles)	Average Passes to Learn	3.15	13.58	+10.43

Experiment Figures

Learning curves for the 'Exceptions to Rules' vehicle classification task, showing accuracy over 15 iterations

Main Takeaways

CoT consistently reduces performance on tasks involving implicit statistical learning, verbal overshadowing, and rules with exceptions, paralleling human cognitive failures
The 'Human Overthinking' heuristic is predictive but not absolute; CoT helps where models have superior priors (e.g., formal logic) or memory (context windows) compared to humans
Model performance drops are robust across different model families (GPT, Claude, Llama) and modalities (text, vision)

📚 Prerequisite Knowledge

Prerequisites

Understanding of Chain-of-Thought (CoT) prompting
Basic familiarity with cognitive psychology concepts (verbal overshadowing, implicit learning)

Key Terms

CoT: Chain-of-Thought—a prompting technique encouraging models to generate intermediate reasoning steps

FSG: Finite State Grammar—a simple type of grammar used to generate strings (artificial 'words') following specific rule structures

Verbal Overshadowing: A psychological phenomenon where describing a non-verbal stimulus (like a face) impairs the ability to recognize it later

Implicit Statistical Learning: Learning patterns from data without explicit instruction or the ability to verbalize the rules (e.g., learning a grammar by seeing examples)

LMM: Large Multimodal Model—an AI model capable of processing both text and images (e.g., GPT-4o, Claude 3.5 Sonnet)

Inference-time reasoning: The process where a model spends computational resources generating reasoning tokens before producing a final answer (e.g., OpenAI o1)