Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning

📝 Paper Summary

Reasoning LLMs Inference Efficiency Test-time Compute

Selecting shorter reasoning chains from parallel generations consistently outperforms standard majority voting and longer chains while reducing computational cost and latency.

Core Problem

Current reasoning LLMs rely on scaling test-time compute by generating long thinking chains, which incurs high computational costs and latency without necessarily guaranteeing correctness.

Why it matters:

The assumption that longer thinking equals better reasoning drives massive increases in inference cost and energy consumption.
Slow decoding times due to long autoregressive chains hinder real-time applications of reasoning models.
Existing methods like majority voting waste resources on long, often incorrect trajectories.

Concrete Example: When a model generates 20 solutions for a math problem, the longest chain might backtrack excessively and halluncinate, while a much shorter chain reaches the correct answer directly. Standard majority voting treats them equally or favors the more numerous (potentially long) errors, whereas the proposed method picks the short one.

Key Novelty

short-m@k Inference Strategy

Run k parallel generations but strictly halt all computation as soon as the first m trajectories finish (where m is small, e.g., 1 or 3).
Select the final answer via majority vote among only these shortest m completed chains (breaking ties by length).
Leverages the empirical finding that for a specific question, correct reasoning chains are typically shorter than incorrect ones.

Architecture

Visualization of the short-m@k inference process compared to standard approaches.

Evaluation Highlights

Choosing the shortest chain outperforms the longest chain by up to 34.5% accuracy on math benchmarks.
short-1@k matches majority voting performance while reducing compute by up to 40% on LN-Super-49B.
short-3@k consistently outperforms majority voting across all compute budgets while reducing wall time by up to 33%.

Breakthrough Assessment

8/10

Challenges the prevailing 'more compute/longer thinking is better' paradigm with strong empirical evidence. Offers a simple, practical inference method that improves both speed and accuracy.

⚙️ Technical Details

Problem Definition

Setting: Complex reasoning tasks (Math, Science) where models generate intermediate 'thinking' tokens before a final answer.

Inputs: Reasoning question q

Outputs: Final answer derived from a generated reasoning chain

Pipeline Flow

Parallel Generation (start k sequences)
Early Termination (stop when m sequences finish)
Selection/Voting (vote among m completed)

System Modules

Generator

Generate k reasoning chains in parallel

Model or implementation: Various Reasoning LLMs (e.g., Llama-3.3-Nemotron-Super-49B, QwQ-32B)

Monitor/Stopper

Monitor completion of chains and halt all computation once m chains are done

Model or implementation: Rule-based logic

Aggregator

Select final answer from the m completed chains

Model or implementation: Majority Vote (tie-break: shortest length)

Novel Architectural Elements

Latency-aware termination condition: Stopping batch generation based on the completion of the fastest subset (m) rather than waiting for all (k) or a fixed token limit.

Modeling

Base Model: Llama-3.3-Nemotron-Super-49B-v1 (LN-Super-49B), R1-Distill-Qwen-32B, QwQ-32B, R1-670B

Training Method: Supervised Fine-Tuning (SFT) on filtered datasets

Adaptation: Full fine-tuning

Trainable Parameters: All parameters (for the fine-tuning experiments on Qwen-2.5)

Training Data:

S1-short: Examples with shortest reasoning chains from S1 dataset
S1-long: Examples with longest reasoning chains from S1 dataset
S1-random: Randomly selected chains from S1 dataset

Key Hyperparameters:

temperature: 0.7
top_p: 0.95
max_tokens: 32768

Compute: Inference run on vLLM with paged attention. Training compute not explicitly detailed.

Comparison to Prior Work

vs. Majority Voting: short-m@k terminates early (m < k), saving compute and time, and biases selection toward shorter chains which are empirically more accurate.
vs. FFS: short-m@k generalizes FFS by introducing the early stopping parameter m and explicitly focusing on reasoning chain length as a termination criterion [FFS cited as related].
vs. Best-of-N: Does not require a separate reward model or verifier [not cited in paper].

Limitations

Effectiveness relies on the correlation between shortness and correctness, which may not hold for all task types (though shown for math/science).
Requires parallel serving infrastructure to realize wall-time gains (sequential generation wouldn't see speedup).
short-1@k performance can degrade at very large sample sizes (k) compared to majority voting for some models.

Reproducibility

No specific code URL provided in the paper text. Uses standard open models (Llama, Qwen, DeepSeek variants) and standard benchmarks (AIME, GPQA). Methods (short-m@k) are algorithmically simple to reproduce.

📊 Experiments & Results

Evaluation Setup

Complex reasoning on Math and Science problems

Benchmarks:

AIME 2024 (Math Competition)
AIME 2025 (Math Competition)
HMMT Feb 2025 (Math Competition)
GPQA-Diamond (Scientific Reasoning (Multiple Choice))

Metrics:

Accuracy
Thinking-compute (total tokens generated)
Time-to-answer (wall time)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of single generation selection strategies (Shortest vs. Random vs. Longest) showing strong bias for shorter answers.
Average (Math Benchmarks)	Accuracy	44.4	56.7	+12.3
Average (Math Benchmarks)	Accuracy	50.0	60.0	+10.0
Performance of short-3@k vs majority voting across compute budgets.
Average (Math Benchmarks)	Accuracy	62.0	64.0	+2.0
Average (Math Benchmarks)	Accuracy	78.0	82.0	+4.0
Average (Math Benchmarks)	Wall Time	1.0	0.67	-0.33

Experiment Figures

Accuracy vs Sample Size (k) for short-1@k, short-3@k, Majority Voting, and Pass@k Oracle.

Accuracy vs Thinking Compute (Total Tokens) for different methods.

Main Takeaways

Within a set of generations for the same question, the shortest reasoning chain is significantly more likely to be correct than random or long chains.
Longer chains often indicate the model is 'confused', engaging in excessive backtracking or loops.
Fine-tuning on shorter correct chains (S1-short) improves performance compared to training on long chains (S1-long), suggesting conciseness is a learnable virtue.
short-3@k offers a 'sweet spot', consistently beating majority voting in accuracy while being faster and cheaper.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Majority voting (Self-Consistency)
Autoregressive decoding
Test-time compute scaling

Key Terms

short-m@k: An inference method that runs k parallel generations but stops all processes as soon as m generations complete, voting among those m outputs.

thinking tokens: Intermediate tokens generated by a reasoning model (often enclosed in <think> tags) before the final answer.

pass@k: A metric measuring the probability that at least one correct answer exists within k generated samples.

majority voting: An aggregation method that generates multiple samples and selects the most frequent final answer (also known as Self-Consistency).

test-time compute: The amount of computational resources (FLOPs, tokens) used during inference to generate a response, often scaled by generating more tokens or samples.

parallel decoding: Generating multiple sequences simultaneously using batching, as opposed to sequential generation.

SFT: Supervised Fine-Tuning—training a model on labeled examples of inputs and desired outputs.

RL: Reinforcement Learning—training method where models learn from rewards/penalties.

backtracking: When a reasoning chain revisits previous steps or attempts to correct itself, often indicating difficulty or error.