← Back to Paper List

Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?

Chenrui Fan, Ming Li, Lichao Sun, Tianyi Zhou
University of Maryland, Lehigh University
arXiv.org (2025)
Reasoning Benchmark RL Factuality

📝 Paper Summary

Reasoning Models Safety and Robustness Model Evaluation
Reasoning models fail on ill-posed questions by generating redundant, circular reasoning chains (2-4x longer than necessary) instead of abstaining, unlike non-reasoning models which identify missing premises efficiently.
Core Problem
State-of-the-art reasoning models lack the critical thinking skills to identify ill-posed questions with Missing Premises (MiP), leading to meaningless and excessive computational consumption.
Why it matters:
  • Violates the 'test-time scaling law': increased test-time compute fails to yield correct judgments on solvability
  • Indicates a risk of 'abuse of thinking patterns' where models blindly apply complex reasoning to trivial or impossible tasks
  • Wastes significant computational resources on redundant, circular self-doubt without producing valid answers
Concrete Example: When asked 'What is the value of a?' (unsolvable), DeepSeek-R1 generates thousands of tokens and spends minutes thinking, eventually outputting a hallucinated '2', whereas a human or standard LLM would immediately ask for clarification.
Key Novelty
Identification of MiP-Overthinking
  • Distinguishes 'MiP-Overthinking' (failure to identify unsolvable queries) from general overthinking (excessive steps for simple queries)
  • Reveals that reasoning models (trained with RL) are more susceptible to this failure mode than non-reasoning models, contradicting the expectation that more reasoning leads to better judgment
Evaluation Highlights
  • Reasoning models generate 2x to 4x more tokens on MiP questions compared to well-defined ones, often exceeding 3,000 tokens for simple unsolvable math queries
  • Step-level similarity in reasoning chains increases from 0.45 (well-defined) to 0.50 (MiP), indicating high redundancy and looping behavior
  • Non-reasoning models consistently outperform reasoning models in efficiency, using ~200 tokens to correctly identify unsolvable questions where reasoning models use >1,000
Breakthrough Assessment
7/10
Significant diagnostic paper revealing a critical flaw in current reasoning model training (RL/SFT) regarding negative constraints and critical thinking, though it does not propose a solution.
×