Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency

📝 Paper Summary

Efficient Reasoning Chain-of-Thought (CoT) Optimization

NoWait improves reasoning efficiency by suppressing specific self-reflection tokens (like 'Wait' or 'Hmm') during inference, reducing redundant steps without compromising model accuracy.

Core Problem

Large Reasoning Models (LRMs) suffer from 'overthinking,' generating excessively verbose and redundant Chain-of-Thought trajectories characterized by self-reflection tokens like 'Wait' and 'Hmm,' which increases latency and compute costs.

Why it matters:

Excessively long reasoning chains (thousands of tokens) create significant computational overhead and latency, hindering deployment in resource-constrained applications
Current efficiency methods often require expensive retraining (RL with length penalties) or compromise model utility when simply truncating thoughts

Concrete Example: In a math problem, a model might solve the equation, then generate 'Wait, let me double check...', entering a redundant validation loop that restates the same logic for hundreds of tokens before outputting the same answer.

Key Novelty

NoWait (Inference-time Keyword Suppression)

Identifies a set of 'reflection keywords' (e.g., 'Wait', 'Alternatively') that signal the start of redundant self-verification loops
Modifies the decoding process to set the probability (logits) of these specific tokens to negative infinity, preventing the model from initiating unnecessary self-reflection paths
Forces the model to continue forward reasoning or conclude, pruning the 'Aha Moment' redundancy without altering model weights

Architecture

Illustration of the inference-time intervention where reflection keywords are suppressed.

Evaluation Highlights

Reduces Chain-of-Thought trajectory length by 27%–51% across five R1-style model series (textual, visual, and video)
Maintains or improves accuracy: QwQ-32B achieves +4.25% accuracy on AMC 2023 while reducing token count by ~30%
Outperforms training-free baselines: On visual reasoning (MMMU, MathVista), reduces tokens by ~49% with only a 3.42% average accuracy drop, whereas prompt-based 'NoThink' causes severe degradation

Breakthrough Assessment

7/10

Simple, effective, plug-and-play solution to a timely problem (overthinking in R1-style models). While the technical novelty is low (logit masking), the empirical finding that 'thinking' tokens are largely redundant is significant.

⚙️ Technical Details

Problem Definition

Setting: Efficient inference for Large Reasoning Models (LRMs) generating Chain-of-Thought (CoT)

Inputs: Reasoning problem prompt P

Outputs: Reasoning chain C and final answer A

Pipeline Flow

Keyword Identification (Pre-computation)
Token Expansion (Pre-computation)
Inference with Logit Suppression

System Modules

Keyword Identifier (Preprocessing)

Determine which words signal redundant self-reflection

Model or implementation: Empirical analysis on QwQ-32B

Token Expander (Preprocessing)

Map text keywords to specific tokenizer IDs

Model or implementation: Tokenizer lookup

Logit Processor

Prevent generation of reflection tokens during decoding

Model or implementation: N/A (Heuristic constraint)

Novel Architectural Elements

Intervention at the logit level specifically targeting 'anthropomorphic' reflection tokens to prune reasoning branches without stopping the reasoning process entirely

Modeling

Base Model: Evaluated on QwQ-32B, Phi4-Reasoning-Plus, Qwen3-32B, Kimi-VL-A3B-Thinking, QvQ-72B

Comparison to Prior Work

vs. NoThink: NoThink prompts model to skip thinking entirely, often causing accuracy collapse; NoWait keeps reasoning but prunes redundant branches
vs. Token-Budget: Token-Budget is a soft constraint via prompt often ignored by RL models; NoWait is a hard constraint on token generation
vs. O1-Pruner: O1-Pruner requires training and degrades performance on some models (QwQ); NoWait is training-free and preserves utility

Limitations

Relies on heuristic keyword lists which may vary by model or tokenizer
Does not improve efficiency for models trained to not think (e.g., standard instruct models), only RL-based reasoning models
Slight performance degradation observed in some visual reasoning tasks (-3.42% on Kimi-VL)

Reproducibility

Method is training-free and relies on logit manipulation. Keywords list generation process is described (top-15 frequent words after delimiters). Specific keyword lists for each model are not explicitly provided in the text but method to generate them is. No specific code URL provided in the text.

📊 Experiments & Results

Evaluation Setup

Greedy decoding (or low temp) on math and multimodal reasoning benchmarks

Benchmarks:

AMC 2023 (Math Reasoning)
AIME 2024 / AIME 2025 (Challenging Math Reasoning)
MMMU / MMMU-Pro (Visual Reasoning)

Metrics:

Accuracy (ACC)
Generation Length (LEN)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Textual reasoning results demonstrate that NoWait significantly reduces token usage while often improving accuracy on math benchmarks.
AMC 2023	Accuracy	Not reported in the paper	Not reported in the paper	+4.25
AMC 2023	Generation Length	100	70	-30
AMC 2023	Accuracy	Not reported in the paper	Not reported in the paper	+6.00
AIME 2025	Accuracy	Not reported in the paper	Not reported in the paper	+1.33
Multimodal results show substantial efficiency gains with minor accuracy trade-offs.
Image Reasoning Average (MMMU, MathVista, etc.)	Generation Length	2000	1020	-980
Image Reasoning Average	Accuracy	Not reported in the paper	Not reported in the paper	-3.42

Main Takeaways

Reasoning redundancy is prevalent: LRMs across text, image, and video modalities generate significant 'thought' padding that can be removed without harming outcomes.
Difficulty generalization: NoWait works across difficulty levels (AMC to AIME 2025), with even better performance retention on harder tasks.
Model generalization: Consistent efficiency gains across diverse architectures (Qwen, Phi, Kimi, QvQ) suggest the 'Wait' token phenomenon is a common artifact of current RL training strategies.
Reinforcement Learning inefficiency: Current RL algorithms incentivize models to adopt a low threshold for self-reflection, leading to unnecessary verification steps that NoWait effectively prunes.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) reasoning
Autoregressive decoding / Logits
Reinforcement Learning (RL) for reasoning models

Key Terms

LRM: Large Reasoning Model—LLMs specifically optimized (often via RL) to generate long, detailed reasoning chains before answering

CoT: Chain-of-Thought—a step-by-step reasoning trajectory generated by the model to decompose complex problems

Logit: The raw, unnormalized score output by the model for each token in the vocabulary before the softmax layer turns them into probabilities

Thinking Chunk: A segment of reasoning within a CoT trajectory, often delimited by self-reflection keywords and associated with an intermediate result

Aha Moment: A phenomenon where a model rethinks or self-reflects on its trajectory, often signaled by anthropomorphic terms like 'Wait'

NoThink: A baseline strategy that attempts to remove reasoning entirely via prompt engineering (e.g., instructing the model 'Do not output thinking steps')