Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?

📝 Paper Summary

Reasoning Models Safety and Robustness Model Evaluation

Reasoning models fail on ill-posed questions by generating redundant, circular reasoning chains (2-4x longer than necessary) instead of abstaining, unlike non-reasoning models which identify missing premises efficiently.

Core Problem

State-of-the-art reasoning models lack the critical thinking skills to identify ill-posed questions with Missing Premises (MiP), leading to meaningless and excessive computational consumption.

Why it matters:

Violates the 'test-time scaling law': increased test-time compute fails to yield correct judgments on solvability
Indicates a risk of 'abuse of thinking patterns' where models blindly apply complex reasoning to trivial or impossible tasks
Wastes significant computational resources on redundant, circular self-doubt without producing valid answers

Concrete Example: When asked 'What is the value of a?' (unsolvable), DeepSeek-R1 generates thousands of tokens and spends minutes thinking, eventually outputting a hallucinated '2', whereas a human or standard LLM would immediately ask for clarification.

Key Novelty

Identification of MiP-Overthinking

Distinguishes 'MiP-Overthinking' (failure to identify unsolvable queries) from general overthinking (excessive steps for simple queries)
Reveals that reasoning models (trained with RL) are more susceptible to this failure mode than non-reasoning models, contradicting the expectation that more reasoning leads to better judgment

Evaluation Highlights

Reasoning models generate 2x to 4x more tokens on MiP questions compared to well-defined ones, often exceeding 3,000 tokens for simple unsolvable math queries
Step-level similarity in reasoning chains increases from 0.45 (well-defined) to 0.50 (MiP), indicating high redundancy and looping behavior
Non-reasoning models consistently outperform reasoning models in efficiency, using ~200 tokens to correctly identify unsolvable questions where reasoning models use >1,000

Breakthrough Assessment

7/10

Significant diagnostic paper revealing a critical flaw in current reasoning model training (RL/SFT) regarding negative constraints and critical thinking, though it does not propose a solution.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of model behavior when facing ill-posed queries lacking necessary conditions

Inputs: Ill-posed question q_mip (Missing Premise)

Outputs: Reasoning chain c and final answer a (ideally an abstention)

Modeling

Base Model: Evaluates multiple models including DeepSeek-R1, QwQ-32B, and GPT-o1 (specific training details not applicable as this is an evaluation paper)

Comparison to Prior Work

vs. General Overthinking: Focuses specifically on *unsolvable* inputs (MiP), identifying a distinct failure mode where reasoning length inversely correlates with quality
vs. Standard Benchmarking: Introduces specific 'negative constraints' (missing premises) to test critical thinking/abstention rather than just problem-solving accuracy

Limitations

Evaluation is limited to mathematical reasoning tasks; generalization to coding or commonsense reasoning is not fully explored
Primary analysis relies on existing models; no new training method is proposed to mitigate the issue
Analysis of proprietary models (like GPT-o1) is limited by lack of access to their internal probabilities or full reasoning traces

Reproducibility

The paper describes the construction of 4 datasets (MiP-Formula, MiP-SVAMP, MiP-GSM8K, MiP-MATH) in detail. Code URL is not provided in the text. Reference answers for calculation are standard benchmarks.

📊 Experiments & Results

Evaluation Setup

Comparative analysis of Reasoning vs. Non-Reasoning models on Well-defined vs. MiP variants of math problems

Benchmarks:

MiP-GSM8K (Grade school math with removed numerical conditions) [New]
MiP-SVAMP (Elementary math with swapped body/question pairs) [New]
MiP-MATH (Challenging math problems with removed premises) [New]
MiP-Formula (Synthetic formulas with unassigned variables) [New]

Metrics:

Response Length (token count)
Abstain Rate (percentage)
Cosine Similarity of Reasoning Steps
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of response lengths shows explosive growth for reasoning models on MiP questions compared to well-defined ones, while non-reasoning models remain stable.
GSM8K	Response Length (DeepSeek-R1)	1000	3000	+2000
GSM8K	Response Length (Non-reasoning models)	200	200	0
MiP-GSM8K	Step-level Cosine Similarity	0.45	0.50	+0.05
MiP-GSM8K	Step-level Similarity Variance	0.0079	0.00082	-0.00708

Experiment Figures

Comparison of Response Length, Accuracy, and Abstain Rate across various LLMs for well-defined vs. MiP questions.

Heatmap of step-level cosine similarity within model responses on MiP-GSM8K.

Main Takeaways

Reasoning models (like DeepSeek-R1, QwQ) exhibit 'MiP-Overthinking', generating 2-4x more tokens for unsolvable questions than solvable ones, contradicting test-time scaling expectations.
The increased token count in reasoning models consists largely of 'self-doubt loops' (repeating checks, 'wait', 'alternatively') rather than productive critical thinking.
Non-reasoning models are surprisingly more robust to Missing Premise (MiP) questions, quickly abstaining with short responses.
Harder datasets (MiP-MATH) exacerbate the issue, causing even longer redundant reasoning chains and lower abstain rates across models.

📚 Prerequisite Knowledge

Prerequisites

Reasoning Models (e.g., DeepSeek-R1, GPT-o1)
Chain-of-Thought (CoT) prompting
Reinforcement Learning (RL) for reasoning

Key Terms

MiP: Missing Premise—a type of ill-posed question where a necessary condition to solve the problem is absent

MiP-Overthinking: The phenomenon where reasoning models generate excessively long, redundant reasoning paths for unsolvable MiP questions instead of abstaining

Test-time scaling law: The empirical observation that increasing the amount of compute (reasoning tokens) at inference time typically improves performance

Abstain Rate: The proportion of responses where the model explicitly declines to answer due to insufficient information

Self-doubt loop: A thinking pattern where the model repeatedly checks and questions its own reasoning without making progress, common in MiP-Overthinking

SFT: Supervised Fine-Tuning—training a model on a dataset of expert demonstrations

DeepSeek-R1: An open-source large language model optimized for reasoning tasks using reinforcement learning