OverThink: Slowdown Attacks on Reasoning LLMs

📝 Paper Summary

Adversarial Attacks on LLMs Inference Efficiency Prompt Injection

OverThink injects benign computational puzzles into retrieved context to force Reasoning LLMs to generate excessive hidden reasoning tokens, inflating inference costs and latency without altering the final answer.

Core Problem

Reasoning LLMs (RLMs) generate costly 'hidden' reasoning tokens to solve problems, but adversaries can exploit this mechanism to artificially inflate computational costs and latency.

Why it matters:

Financial Impact: Output tokens (including hidden reasoning) cost money; inflating them increases operational costs for API providers or users with usage limits
Denial of Service: Excessive reasoning increases latency, potentially causing timeouts or delaying service for other users in resource-constrained environments
Stealth: Unlike jailbreaks that produce visible harmful content, slowdown attacks preserve the correct final answer, making them harder for users to detect

Concrete Example: A user asks an RLM-backed assistant to summarize a webpage. The webpage contains a hidden 'decoy' task (e.g., a complex Sudoku puzzle) injected by an adversary. The RLM detects the puzzle, spends thousands of tokens solving it in its hidden scratchpad (incurring high cost), and then correctly summarizes the page. The user sees the correct summary but the provider pays 46x the expected cost.

Key Novelty

Stealthy Reasoning Slowdown Attack via Decoy Tasks

Identifies a new attack surface: the 'scratchpad' or reasoning chain of RLMs, which is usually hidden from users but counted towards billing and compute limits
Uses 'decoy' problems (like logic puzzles) that are benign enough to bypass safety filters but computationally expensive enough to trigger massive reasoning chains
Optimizes decoys using 'ICL-Evolve' to maximize reasoning effort while ensuring the final answer remains contextually accurate and stealthy

Architecture

Conceptual workflow of the OverThink attack, contrasting a normal RLM interaction with an attacked one.

Evaluation Highlights

Up to 46x increase in reasoning token count under large-scale context-agnostic attacks
Up to 7.8x increase in reasoning token count under context-aware attacks (where decoys are tailored to the text)
Attacks transfer across multiple state-of-the-art models (OpenAI Chatbots, open-source RLMs) and datasets (FreshQA, SQuAD, MuSR)

Breakthrough Assessment

8/10

Identifies a critical economic and systemic vulnerability in the emerging paradigm of inference-time scaling. The shift from attacking output safety to attacking inference cost is significant.

⚙️ Technical Details

Problem Definition

Setting: Adversarial attack on Reasoning LLMs (RLMs) consuming untrusted external context

Inputs: User query q and external context z (which may be adversarial z*)

Outputs: Reasoning tokens y_r (hidden) and final answer y_a (visible)

Pipeline Flow

Decoy Selection (Choose base problem type)
Decoy Optimization (ICL-Evolve algorithm)
Instruction Crafting (Make task stealthy)
Injection (Insert into context z)
Inference (Target RLM processes q + z*)

System Modules

Decoy Selector (Attack Generation)

Select a base problem type known to induce reasoning

Model or implementation: N/A (Selection logic)

Decoy Optimizer (ICL-Evolve) (Attack Generation)

Iteratively evolve the decoy task to maximize reasoning tokens required to solve it

Model or implementation: Optimization LLM (e.g., GPT-4 or similar proxy)

Injector

Embed the optimized decoy into the external source (text or image)

Model or implementation: Rule-based or LLM-based Rewriter

Novel Architectural Elements

The use of 'Decoy Tasks' specifically as a mechanism for latency/cost attacks rather than behavioral modification
ICL-Evolve optimization loop targeting reasoning token count as the objective function

Modeling

Base Model: Evaluated on OpenAI o1-preview, o1-mini, DeepSeek-R1-Distill-Llama-70B, and others

Comparison to Prior Work

vs. Sponge Examples: Targets generative RLMs and reasoning tokens specifically, rather than generic inference latency in encoders
vs. Indirect Prompt Injection: Goal is 'slowdown/cost' while keeping output correct, rather than hijacking the output content
vs. Nerd Sniping: Intentional adversarial injection rather than accidental triggering; optimization for maximum cost
+ 1 more
vs. Bagdasarian et al. (Energy consumption attacks) [not cited in paper]: OverThink focuses on the specific mechanism of 'reasoning tokens' in CoT models rather than general architectural inefficiencies

Limitations

Relies on the RLM attending to the specific part of the context containing the decoy
Requires the ability to inject content into sources the RLM retrieves (e.g., via wiki edits or web hosting)
Effectiveness depends on the specific RLM's tendency to follow 'solve this first' instructions

Reproducibility

Code: https://github.com/akumar2709/OVERTHINK_public.git

Code is publicly available at https://github.com/akumar2709/OVERTHINK_public.git. The paper evaluates on public datasets (FreshQA, SQuAD, MuSR) and models.

📊 Experiments & Results

Evaluation Setup

RLMs answering user queries using retrieved external context (which contains adversarial injections)

Benchmarks:

FreshQA (Factual Reasoning / QA)
SQuAD (Reading Comprehension)
MuSR (Multi-step Soft Reasoning)

Metrics:

Reasoning Token Count (Primary)
Answer Accuracy / Stealthiness
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Attack effectiveness showing massive increases in reasoning overhead.
Average across datasets	Reasoning Token Increase Factor	1.0	46.0	+45.0
Average across datasets	Reasoning Token Increase Factor	1.0	7.8	+6.8

Main Takeaways

Significant inflation of reasoning tokens is possible without altering the final answer, proving the 'OverThink' concept works.
Context-agnostic attacks (generic templates) cause higher slowdowns (up to 46x) but might be easier to detect than context-aware attacks (up to 7.8x).
Defenses like simple input filtering or paraphrasing the context are currently insufficient to stop the attack without degrading utility.
The attack transfers across different model families, suggesting a general vulnerability in current RLM instruction following.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Chain-of-Thought (CoT) prompting
Familiarity with Inference-Time Scaling in LLMs
Concept of Indirect Prompt Injection

Key Terms

RLM: Reasoning Language Model—a model (like o1 or DeepSeek-R1) that generates intermediate 'reasoning tokens' (scratchpad) before the final answer to improve performance

Reasoning Tokens: Intermediate output tokens generated by an RLM to 'think' through a problem; these are often hidden from the user but incur computational cost

Indirect Prompt Injection: An attack where the adversary hides instructions in external data (e.g., a website) that the LLM retrieves, rather than in the user's direct prompt

Decoy Task: A benign but computationally intensive problem (e.g., Sudoku, Markov Decision Process) injected to force the model to spend resources solving it

ICL-Evolve: The authors' proposed optimization algorithm that uses In-Context Learning to evolve decoy tasks into more computationally expensive variants

Context-Aware Attack: Integrating the decoy task smoothly into the existing text of the retrieved source so it blends in

Context-Agnostic Attack: Injecting a general decoy template into a source without tailoring it to the specific surrounding text

Scratchpad: The internal buffer where an RLM writes its reasoning steps before generating the final response