LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!

📝 Paper Summary

Reasoning Distillation Chain-of-Thought (CoT) Parameter-Efficient Fine-Tuning

LLMs can learn complex reasoning behaviors from very few demonstrations using parameter-efficient fine-tuning, driven primarily by the global structure of the reasoning chain rather than the correctness of intermediate steps.

Core Problem

Training Large Reasoning Models (LRMs) typically requires expensive reinforcement learning or massive datasets, and the specific mechanisms that enable models to learn 'Long CoT' reasoning are poorly understood.

Why it matters:

Existing high-performance reasoning models (like o1) are closed-source or prohibitively expensive to replicate
It is unclear whether models need to learn deep domain knowledge or simply acquire structured reasoning patterns to succeed
Understanding the minimal data requirements for reasoning allows for much cheaper and more accessible model training

Concrete Example: A model trained on 'correct' reasoning traces (where every step is valid) achieves high accuracy. However, if you shuffle those same valid steps, destroying the logical flow, accuracy drops by 13.3% on AIME 2024. Conversely, training on traces with the *wrong* final answer but correct structure results in only a 3.2% drop.

Key Novelty

Structural Reasoning Distillation

Demonstrates that the 'Long CoT' capability (reflection, backtracking) can be distilled into smaller models using only 17k samples via LoRA
Crucial discovery that the *structure* of reasoning (logical coherence, use of 'wait'/'alternatively') is more important for learning than the factual correctness of the training content itself

Architecture

Comparison of the distilled model (Sky-T1-32B-Preview) against the base Qwen2.5 model and OpenAI's o1-preview on Math-500, AIME, and AMC benchmarks.

Evaluation Highlights

+40.0% accuracy improvement on AIME 2024 (16.7% → 56.7%) using Qwen2.5-32B-Instruct fine-tuned on just 17k samples
Achieves 57.0% on LiveCodeBench (+8.1% vs base), competitive with proprietary o1-preview (59.1%)
LoRA fine-tuning updates <5% of parameters yet matches full fine-tuning performance, proving reasoning patterns are not knowledge-intensive to learn

Breakthrough Assessment

9/10

The counter-intuitive finding that models can learn strong reasoning from *incorrect* data (as long as structure is preserved) fundamentally shifts our understanding of instruction tuning and CoT.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) of a Student LLM on reasoning traces generated by a Teacher LLM

Inputs: Complex Math or Coding Prompt q

Outputs: Long Chain-of-Thought response r (including reflection tags) followed by final answer a

Pipeline Flow

Data Curation (Teacher Model generation)
Filtering (Correctness check)
Student Training (LoRA/SFT on curated data)

System Modules

Teacher Model

Generate Long CoT reasoning traces for hard prompts

Model or implementation: DeepSeek-R1 or QwQ-32B-Preview

Student Model

Learn to generate structured reasoning from demonstrations

Model or implementation: Qwen2.5-32B-Instruct

Modeling

Base Model: Qwen2.5-32B-Instruct

Training Method: Supervised Fine-Tuning (SFT) and Low-Rank Adaptation (LoRA)

Adaptation: LoRA (rank not explicitly detailed in text, learning rate 1e-4) or Full SFT (learning rate 1e-5)

Trainable Parameters: <5% for LoRA

Training Data:

17k samples from DeepSeek-R1 (R1-17k dataset)
12k math and 5k coding samples from QwQ-32B-Preview
Hard prompts selected via difficulty classification (AoPS levels)

Key Hyperparameters:

batch_size: 96
learning_rate_sft: 1e-5
learning_rate_lora: 1e-4
+ 2 more
warmup_ratio: 0.1
loss: Next token prediction

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-R1: Distills reasoning via supervised learning rather than bootstrapping via RL
vs. Standard SFT: Shows reasoning can be learned from incorrect content if structure is preserved, unlike standard SFT which emphasizes data quality/correctness
vs. Vicuna: Distills reasoning specifically, not just chat/style capabilities [not cited in paper]

Limitations

Reliance on a stronger teacher model (DeepSeek-R1/QwQ) for data generation
Performance drops significantly if the logical structure of training data is broken (shuffling)
Does not explore the limits of how 'incorrect' the content can be before structure learning fails (though 70% digit corruption was tested)

Reproducibility

Code: https://github.com/NovaSky-AI/SkyThought

Code available at https://github.com/NovaSky-AI/SkyThought. R1-17k dataset is public on HuggingFace. Detailed methodology for data curation (filtering by difficulty) is provided.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on math and coding benchmarks using greedy decoding (implied by Pass@1 metrics)

Benchmarks:

AIME 2024 (Challenging Math Competition)
Math-500 (General Math Reasoning)
LiveCodeBench (Code Generation)
OlympiadBench (Olympiad-level Math/Physics)

Metrics:

Accuracy (Pass@1)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Fine-tuning Qwen2.5-32B with just 17k R1 samples yields massive gains, matching or beating proprietary baselines.
AIME 2024	Accuracy	16.7	56.7	+40.0
LiveCodeBench	Accuracy	48.9	57.0	+8.1
Math-500	Accuracy	84.8	90.8	+6.0
Ablation studies reveal that model performance is robust to content errors but highly sensitive to structural corruption.
Average across benchmarks	Accuracy	66.3	63.1	-3.2
Average across benchmarks	Accuracy	66.3	62.0	-4.3
AIME 2024	Accuracy	53.3	40.0	-13.3

Experiment Figures

Impact of dataset size on model performance across benchmarks.

Visual explanation of the structural perturbation experiments: Deletion, Insertion, and Shuffle of reasoning steps.

Main Takeaways

Structure over Content: The model's ability to reason is driven by the logical structure of the CoT (reflection, backtracking) rather than the factual correctness of the numbers or answers in the training data.
Data Efficiency: Only 17k samples are needed to saturate performance; increasing data size beyond 16k yields diminishing returns.
Parameter Efficiency: LoRA works just as well as full fine-tuning for reasoning distillation, suggesting reasoning is a 'style' or 'pattern' rather than new knowledge.
Logical Consistency is Key: Shuffling or inserting disjoint reasoning steps destroys performance, confirming the model learns global coherence, not just local step imitation.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) fine-tuning (SFT)
Familiarity with Chain-of-Thought (CoT) prompting
Basic knowledge of LoRA (Low-Rank Adaptation)

Key Terms

Long CoT: Extended reasoning traces that include explicit steps for reflection, backtracking, and self-validation (e.g., 'Wait, let me check that')

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning method that freezes pre-trained weights and injects trainable rank-decomposition matrices

SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs

LRM: Large Reasoning Model—models specifically optimized for complex multi-step reasoning tasks (e.g., OpenAI o1, DeepSeek-R1)

DeepSeek-R1: A strong open-source reasoning model used as a 'teacher' to generate training data in this paper

QwQ: Qwen-based reasoning model used as a teacher for distillation

AIME: American Invitational Mathematics Examination—a challenging high-school math competition benchmark