Large Language Model Reasoning Failures

📝 Paper Summary

LLM Reliability and Robustness Cognitive Evaluation of LLMs

This survey unifies fragmented research on Large Language Model (LLM) limitations by proposing a taxonomy that categorizes reasoning failures into fundamental, application-specific, and robustness issues across embodied and non-embodied domains.

Core Problem

Despite impressive performance, LLMs exhibit significant, systematic reasoning failures—ranging from simple logical slips to complex social misunderstandings—that remain fragmented across the literature without a unified framework.

Why it matters:

Current research on failures is scattered and case-by-case, preventing the identification of common root causes like architectural deficits or training data biases
LLMs are increasingly deployed in high-stakes domains, yet their reasoning is often brittle, failing under minor prompt variations (robustness issues) or specific contexts
Understanding these failures through a cognitive science lens (e.g., lack of executive function) is necessary to guide the development of more reliable AI systems

Concrete Example: In Theory of Mind (ToM) tasks, such as the 'false belief' test, advanced models like GPT-4 may solve the standard version but fail decisively when the prompt phrasing is slightly modified, revealing that they lack a robust, human-like understanding of others' mental states.

Key Novelty

Unified Reasoning Failure Taxonomy

Classifies reasoning into two main axes: Embodied (requiring physical interaction) vs. Non-embodied (subdivided into Informal/Intuitive and Formal/Logical)
Categorizes failures into three types: Fundamental (intrinsic to architecture), Application-specific (domain limitations), and Robustness (instability under minor variations)
Synthesizes root causes by mapping LLM failures to human cognitive phenomena, such as deficits in working memory, inhibitory control, and susceptibility to cognitive biases

Architecture

A hierarchical taxonomy chart organizing LLM reasoning failures

Breakthrough Assessment

8/10

Comprehensive survey that structures a critical but fragmented field. While it doesn't propose a new model, its taxonomy provides a necessary foundation for future robustness research.

⚙️ Technical Details

Problem Definition

Setting: Survey and taxonomy construction for analyzing deviations in model responses from logical coherence, contextual relevance, or factual correctness

Inputs: Existing literature and failure cases of LLMs

Outputs: Structured taxonomy and analysis of reasoning failures

Limitations

The survey relies on behavioral assessment of black-box models, as internal interpretability remains limited
Rapid evolution of LLM capabilities means specific failure examples may become outdated quickly (e.g., o1-mini improving on ToM)
Does not propose a new automated benchmark for systematic evaluation, but rather organizes existing findings

Reproducibility

Code: https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures

The paper releases a GitHub repository (https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures) containing the collected literature and failure cases. As a survey, there are no model weights or training code to reproduce.

📊 Experiments & Results

Evaluation Setup

Qualitative synthesis of existing literature across cognitive, social, and logical reasoning domains

Benchmarks:

Theory of Mind (ToM) Tasks (Social reasoning (False belief, perspective taking))
Cognitive Bias Tests (Informal reasoning (Content effect, Anchoring bias))

Metrics:

Robustness (consistency across prompt variations)
Logical coherence
Factual correctness
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

LLMs exhibit fundamental deficits in core executive functions: they suffer from limited working memory (high proactive interference), poor inhibitory control (inability to suppress default responses), and lack cognitive flexibility
Cognitive biases are pervasive: LLMs show content effects (struggling with abstract topics), confirmation bias, order bias (sensitive to information sequence), and framing effects, often inherited from training data or amplified by RLHF
Social reasoning is brittle: While models like GPT-4 can pass standard Theory of Mind (ToM) tests, they fail under slight perturbations, indicating they lack a robust, internalized representation of others' mental states
Moral and ethical reasoning is inconsistent: LLMs fail to consistently apply social norms or moral values, often prioritizing task completion over ethical coherence or fluctuating based on prompt phrasing
Multi-Agent Systems (MAS) amplify individual failures: Individual reasoning deficits propagate in MAS, leading to failures in long-horizon planning and coordination due to poor theory of mind between agents

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) training paradigms (pre-training, RLHF)
Basic concepts from Cognitive Science (Executive functions, Theory of Mind)
Familiarity with reasoning benchmarks (e.g., mathematical reasoning, common sense QA)

Key Terms

Embodied Reasoning: Cognitive processes that depend on physical interaction with environments, relying on spatial intelligence and real-time feedback

ToM: Theory of Mind—the cognitive ability to attribute mental states (beliefs, intents, emotions) to oneself and others

RLHF: Reinforcement Learning from Human Feedback—a method to align LLM outputs with human preferences

CoT: Chain-of-Thought—a prompting technique encouraging models to generate intermediate reasoning steps

Proactive Interference: A memory failure where earlier information significantly disrupts the retrieval or processing of newer information

Executive Functions: Core cognitive processes including working memory, inhibitory control, and cognitive flexibility necessary for reasoning

MAS: Multi-Agent Systems—systems where multiple AI agents interact and collaborate to solve tasks

Hallucination: Generations that are factually incorrect or nonsensical, often stemming from the model's reliance on statistical patterns rather than grounded truth