Evaluation Setup
Comparative evaluation of zero-shot vs. Chain-of-Thought prompting across 6 task archetypes derived from cognitive psychology
Benchmarks:
- Implicit Statistical Learning (Grammar) (Binary classification of strings based on artificial grammar) [New]
- Verbal Overshadowing (Faces) (Face recognition from descriptions/visuals) [New]
- Exceptions to Rules (Vehicles) (Multi-turn classification learning with feedback) [New]
- Logical Inconsistency (Identifying logical contradictions) [New]
Metrics:
- Accuracy
- Number of passes to convergence (learning efficiency)
- Statistical methodology: Mentions results are statistically significant but specific test details are not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| CoT dramatically reduces performance on Implicit Statistical Learning tasks, where pattern matching outweighs explicit rule formulation. |
| Implicit Statistical Learning (Grammar) |
Accuracy |
94.95 |
58.64 |
-36.31
|
| CoT impairs performance on Verbal Overshadowing tasks, mirroring human difficulty in verbalizing fine-grained visual details. |
| Verbal Overshadowing (Faces) |
Accuracy |
62.20 |
50.80 |
-11.40
|
| CoT hinders learning when data contains exceptions to simple rules, leading to inefficient hypothesis testing. |
| Exceptions to Rules (Vehicles) |
Average Passes to Learn |
3.15 |
13.58 |
+10.43
|
Main Takeaways
- CoT consistently reduces performance on tasks involving implicit statistical learning, verbal overshadowing, and rules with exceptions, paralleling human cognitive failures
- The 'Human Overthinking' heuristic is predictive but not absolute; CoT helps where models have superior priors (e.g., formal logic) or memory (context windows) compared to humans
- Model performance drops are robust across different model families (GPT, Claude, Llama) and modalities (text, vision)