QFFT, Question-Free Fine-Tuning for Adaptive Reasoning

📝 Paper Summary

Chain-of-Thought (CoT) Reasoning Model Distillation Efficient Inference

QFFT fine-tunes language models exclusively on reasoning responses without input questions, preserving efficient short reasoning patterns while enabling reflective long reasoning only when errors or uncertainty arise.

Core Problem

Distilling Long Chain-of-Thought (CoT) capabilities into smaller models via Supervised Fine-Tuning (SFT) causes 'overthinking,' where models indiscriminately apply lengthy, redundant reasoning even to simple problems.

Why it matters:

Standard SFT overrides a model's efficient 'Short CoT' capabilities, increasing inference latency and cost by generating unnecessary tokens
Current methods force a fixed mapping from Questions to Long Responses, losing the ability to adapt reasoning length to problem difficulty
Existing Long-to-Short compression methods often require substantial additional training or degrade performance on complex tasks

Concrete Example: When solving a simple math problem (e.g., calculating walking time), a standard SFT model distilled from DeepSeek-R1 generates 1,646 tokens of redundant checks. In contrast, the QFFT model solves it in 415 tokens using Short CoT, but correctly triggers lengthy verification (starting with 'Wait...') only when it encounters a non-integer root in a complex step.

Key Novelty

Question-Free Fine-Tuning (QFFT)

Removes the input question from the training data, fine-tuning the model using *only* the Long CoT response sequences
Prevents the model from learning a rigid Question-to-LongResponse mapping, thereby preserving its default Short CoT behavior for simple inputs
Implicitly teaches the model to trigger reflective Long CoT patterns (like self-correction) only when internal uncertainty arises during generation

Architecture

A case study illustrating the adaptive reasoning flow of a QFFT model.

Evaluation Highlights

Reduces average response length by ~50% across GSM8K, MATH, and AIME datasets compared to standard SFT while maintaining comparable accuracy
Achieves 78.6% accuracy on MATH with completely irrelevant/noisy input questions (Level IV noise), whereas standard SFT collapses to 0.4%
Outperforms SFT by +8.7 points on MMLU-Pro (Out-of-Domain) with Qwen2.5-32B-Instruct, demonstrating superior generalization

Breakthrough Assessment

8/10

Simple yet highly effective intervention (removing questions) that solves the prevalent 'overthinking' problem in reasoning distillation, offering massive efficiency gains with robust performance.

⚙️ Technical Details

Problem Definition

Setting: Distilling reasoning capabilities from a Large Reasoning Model (Teacher) to a smaller Student model

Inputs: Natural language question Q

Outputs: Reasoning chain and final answer R

Pipeline Flow

Input Question
Model generates Short CoT (Default)
Model detects uncertainty/error internally
Model triggers Reflective Long CoT (Adaptive)

System Modules

Base LLM

Generate reasoning and answers; decides implicitly when to switch reasoning modes

Model or implementation: Qwen2.5-Instruct (7B or 32B)

Novel Architectural Elements

No architectural changes to the model itself
Novel training data formatting: The input 'Question' field is empty during fine-tuning (Null-Question SFT)

Modeling

Base Model: Qwen2.5-Instruct (7B and 32B versions)

Training Method: Question-Free Fine-Tuning (SFT on responses only)

Objective Functions:

Purpose: Minimize negative log-likelihood of the reasoning response tokens, ignoring the question.

Formally: L_QFFT = - sum(log P(R_t | R_<t))

Adaptation: Full fine-tuning

Training Data:

Sources: S1.1 (1k samples), LIMO (871 samples), Bespoke-Stratos (17k samples)
Data format: Questions are removed; models trained only on the Long CoT responses distilled from DeepSeek-R1

Key Hyperparameters:

learning_rate: 1e-5 (7B) / 5e-6 (32B)
batch_size: 128 (7B) / 64 (32B)
num_epochs: 3
+ 2 more
max_length: 16384
LR_scheduler: cosine

Compute: Training performed on 8x H800 GPUs

Comparison to Prior Work

vs. SFT-Shortest/DPO: QFFT doesn't require sampling multiple responses or preference labeling; it learns adaptive behavior from single Long CoT examples by removing the query trigger
vs. O1-Pruner: QFFT achieves better accuracy retention (higher AES score) and is simpler to implement (no RL)
vs. DAD: QFFT doesn't require explicit difficulty labeling or dataset curation; adaptation is emergent
+ 1 more
vs. CoT-Valve [not cited in paper]: CoT-Valve dynamically adjusts length via a separate controller/prompting, while QFFT internalizes this switch via weight updates

Limitations

Token reduction is less significant on very difficult datasets (e.g., AIME) where Long CoT is almost always necessary
Relies on the assumption that the base model has strong inherent Short CoT capabilities to preserve
Does not explicitly optimize for conciseness; conciseness is a side effect of preserving the base model's behavior

Reproducibility

Code: https://github.com/LWL-cpu/Question-Free-Fine-Tuning

Code is publicly available. Datasets (S1.1, LIMO, Bespoke-Stratos) are public. Training hyperparameters are fully detailed in Appendix A.2.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning and out-of-domain QA

Benchmarks:

GSM8K (Grade school math)
MATH500 (Competition math)
AIME 2024 / 2025 (Challenging math competitions)
GPQA (Graduate-level QA (OOD))
MMLU-Pro (Multi-task language understanding (OOD))

Metrics:

Accuracy (Pass@1)
Average Token Count
Reasoning Adaptability Cohen’s Kappa (RAK)
Accuracy–Efficiency Score (AES)
Statistical methodology: Reported averages over 16 random sampling runs

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on 7B models show QFFT maintains accuracy while drastically cutting token usage compared to standard SFT.
MATH500 (7B)	Accuracy	80.8	80.2	-0.6
MATH500 (7B)	Tokens	5300	2800	-2500
GSM8K (7B)	Tokens	1700	400	-1300
MATH (Noise Level IV)	Accuracy	0.4	78.6	+78.2
MMLU-Pro (32B)	Accuracy	64.9	73.6	+8.7
MATH500 (7B)	RAK	3.5	47.7	+44.2

Experiment Figures

Reasoning Adaptability Cohen’s Kappa (RAK) scores for various models.

Main Takeaways

QFFT effectively mitigates overthinking: it uses Short CoT for simple problems (GSM8K) and Long CoT for hard ones (AIME), matching SFT accuracy with far fewer tokens.
The method is extremely robust to noisy training data (even 100% irrelevant questions) because it does not learn a dependence on the input question pattern.
QFFT outperforms SFT in low-resource and out-of-domain scenarios, suggesting it learns general reasoning structures rather than overfitting to specific question-answer mappings.
Analysis reveals QFFT models default to Short CoT and switch to Long CoT primarily upon encountering verification needs or errors (signaled by 'Wait...').

📚 Prerequisite Knowledge

Prerequisites

Supervised Fine-Tuning (SFT)
Chain-of-Thought (CoT) prompting
Knowledge Distillation

Key Terms

Long CoT: Reasoning patterns characterized by extensive self-reflection, verification, and error correction (e.g., DeepSeek-R1 traces)

Short CoT: Direct, concise reasoning without extensive self-reflection, typical of standard instruction-tuned models

SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs

RAK: Reasoning Adaptability Cohen’s Kappa—a metric proposed in this paper to measure how well a model's choice of reasoning pattern (Long vs. Short) aligns with problem difficulty

Overthinking: The tendency of models trained on Long CoT data to generate redundant reasoning steps for simple problems

OOD: Out-of-Domain—evaluating the model on datasets significantly different from its training distribution