ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning

📝 Paper Summary

Large Reasoning Models (LRMs) Chain-of-Thought (CoT) Compression Efficient Inference

ConCISE constructs concise training data by injecting confidence phrases to suppress redundant self-corrections and using early stopping to truncate post-answer verification, enabling smaller, faster reasoning models without accuracy loss.

Core Problem

Large Reasoning Models generate excessively verbose Chain-of-Thought outputs, increasing computational costs. Existing compression methods like sampling or pruning either fail to remove all redundancy or disrupt reasoning coherence.

Why it matters:

LRMs like OpenAI-o1 and DeepSeek-R1 incur high inference costs due to long reasoning chains
Sampling-based compression lacks control during generation, leaving unnecessary steps
Post-hoc pruning breaks the logical flow of reasoning, degrading model performance after fine-tuning

Concrete Example: An LRM might correctly solve a math problem but then spend multiple steps explicitly verifying the answer despite high internal confidence ('Wait, let me double check...'), or it might doubt a correct intermediate step ('Is this correct? Let me re-calculate'), adding zero semantic value.

Key Novelty

Confidence-guided Compression in Step-by-step Efficient Reasoning (ConCISE)

Identifies 'Confidence Deficit' (doubting correct steps) and 'Termination Delay' (continuing after solving) as key redundancy sources linked to internal confidence levels
Uses 'Confidence Injection' to artificially boost model confidence before potential reflection points, suppressing unnecessary self-correction steps during data generation
Employs 'Early Stopping' with a lightweight confidence detector to halt generation immediately after the model is statistically certain of its answer

Architecture

The ConCISE data construction pipeline.

Evaluation Highlights

Reduces average response length by ~50% under SimPO fine-tuning compared to original baselines while maintaining high accuracy
Outperforms standard SFT and DPO baselines on math and logic benchmarks (MATH, GSM8K, LogicStruct)
Achieves superior trade-off between token reduction and task performance compared to sampling-based and pruning-based compression methods

Breakthrough Assessment

8/10

Offers a novel, theoretically grounded perspective (confidence dynamics) on CoT redundancy. effectively solving the trade-off between brevity and accuracy where previous pruning methods failed.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning Large Reasoning Models (LRMs) to generate concise Chain-of-Thought (CoT) reasoning

Inputs: Natural language question q

Outputs: Concise reasoning chain S and final answer a

Pipeline Flow

Teacher Generation Loop (Data Construction): Step Generation → Reflection Detection → Confidence Injection (if needed) → Confidence Detection → Early Stopping Check
Student Training: Fine-tune LRM on collected concise chains

System Modules

Confidence Injector (Data Construction)

Suppresses redundant intermediate reflections by inserting high-confidence phrases

Model or implementation: Rules based on token probability

Confidence Detector (Data Construction)

Estimates internal confidence to decide when to stop reasoning

Model or implementation: Lightweight probe using next-token probabilities

Student Model

Learn to generate concise reasoning directly

Model or implementation: Various LRMs (e.g., Qwen-2.5-7B-Instruct, Llama-3.1-8B-Instruct)

Novel Architectural Elements

Dynamic Confidence Injection mechanism during data synthesis that retroactively modifies context to suppress specific types of reasoning steps (reflections)
Probe-based Early Stopping criterion based on confidence phrases rather than heuristic length limits or semantic convergence

Modeling

Base Model: Qwen-2.5-7B-Instruct, Qwen-2.5-14B-Instruct, Qwen-2.5-32B-Instruct, Llama-3.1-8B-Instruct

Training Method: Supervised Fine-Tuning (SFT) and Simple Preference Optimization (SimPO)

Objective Functions:

Purpose: Standard language modeling loss on concise chains.

Formally: Cross-entropy loss on ConCISE-generated data.
Purpose: Optimize for preference of concise chains over original verbose chains.

Formally: SimPO objective L_SimPO = -log sigmoid( beta * ( (log p(yw|x)/|yw|) - (log p(yl|x)/|yl|) - gamma ) )

Adaptation: Full fine-tuning

Training Data:

Source datasets: MATH (1000 samples), GSM8K (1000 samples), LogicStruct (1000 samples)
ConCISE pipeline applied to source questions to generate 'Winning' (concise) responses
Original long responses used as 'Losing' responses for SimPO

Key Hyperparameters:

learning_rate: 5e-6 (SFT), 5e-7 (SimPO)
batch_size: 128
epochs: 2
+ 4 more
simpo_beta: 2.0
simpo_gamma: 1.5
max_seq_length: 2048
lr_scheduler: cosine

Compute: Not reported in the paper

Comparison to Prior Work

vs. Shortest-Correct: ConCISE actively suppresses redundancy during generation via confidence injection, rather than just hoping a short chain is sampled
vs. Post-hoc Pruning: ConCISE generates coherent chains naturally rather than splicing text, avoiding coherence breaks
vs. Step-wise Decoding [not cited in paper]: ConCISE is a data synthesis method for fine-tuning, whereas decoding methods add overhead at inference time

Limitations

Relies on a predefined set of confidence phrases and reflection keywords, which may not generalize to all domains
Confidence detector threshold is set manually (0.5) rather than learned
Experiments limited to math and logic reasoning tasks; applicability to creative writing or open-ended QA is untested
Requires ground truth answers to verify early stopping decisions during data construction

Reproducibility

Methodology is described in detail including algorithms and hyperparams. Confidence phrase pools and detection keywords are in Appendix. Code URL is not provided.

📊 Experiments & Results

Evaluation Setup

Fine-tuning models on compressed data and evaluating on held-out test sets

Benchmarks:

MATH (Mathematical problem solving)
GSM8K (Grade school math)
LogicStruct (Logical reasoning structure)

Metrics:

Accuracy (Acc)
Average Token Length (Len)
Compression Ratio
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ConCISE achieves significant length reduction while maintaining or improving accuracy compared to the base model and standard SFT baselines.
MATH	Acc / Len	Not explicitly reported in the paper	Not explicitly reported in the paper	Not reported in the paper
Average (MATH, GSM8K, LogicStruct)	Token Length	Not explicitly reported in the paper	Not explicitly reported in the paper	Not reported in the paper
MATH	Reflection Rate	High (implied)	~20%	Reduced significantly

Experiment Figures

Scatter plot of Accuracy vs. Average Length for different methods (Base, SFT, SimPO, ConCISE-SFT, ConCISE-SimPO).

Analysis of Confidence Injection effectiveness (bar chart of reflection rates) and Confidence Detector scores over time (line chart).

Main Takeaways

ConCISE-SimPO consistently achieves the best trade-off between length and accuracy compared to baselines like Shortest-Correct SFT or standard SimPO.
Confidence Injection effectively reduces the number of reflection steps without eliminating necessary verifications entirely (retaining ~20% reflection rate).
Early Stopping prevents the 'Termination Delay' phenomenon where models continue generating text after finding the answer, contributing significantly to token reduction.
The method works across different model sizes (7B, 14B, 32B), indicating scalability.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Supervised Fine-Tuning (SFT)
Preference Optimization (SimPO/DPO)
Logits and softmax probabilities

Key Terms

Confidence Deficit: A phenomenon where a model doubts its own correct intermediate steps due to low internal confidence, triggering unnecessary reflection

Termination Delay: A phenomenon where a model continues generating text (reflections) even after reaching a correct and confident answer

SimPO: Simple Preference Optimization—a training method that aligns models with preferences (here, conciseness) without a reference model, often more memory efficient than DPO

SFT: Supervised Fine-Tuning—training a model on high-quality target outputs (here, the concise chains generated by ConCISE)

Confidence Injection: The technique of inserting affirmative phrases (e.g., 'It is clear that...') into the context to artificially raise the model's confidence and prevent it from generating reflection steps

Early Stopping: Terminating the generation process when a detector indicates the model's internal confidence in the answer exceeds a threshold

Probing Prompt: A short text appended to the context to elicit a probability distribution over specific tokens (e.g., 'Wait', 'Great') that serves as a proxy for the model's internal state