Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT?

📝 Paper Summary

Mathematical Reasoning Supervised Fine-Tuning (SFT) Model Capabilities Analysis

By categorizing AIME24 problems into difficulty tiers, the authors find that small-scale SFT solves Medium problems via style transfer but hits a hard ceiling on Extremely Hard problems requiring novel geometric or combinatorial intuition.

Core Problem

While small-scale SFT improves math reasoning, it is unclear whether these gains represent true generalization or overfitting, and what specific types of problems remain unsolvable regardless of data scaling.

Why it matters:

Recent claims suggest small datasets (~1K samples) are sufficient for reasoning, but the limits of this efficiency are unknown.
Understanding which improvements stem from style adoption versus actual reasoning capability is critical for advancing beyond current plateaus.
Identifying specific failure modes (e.g., computational instability vs. lack of intuition) guides the design of future training curricula.

Concrete Example: In AIME 2024 Problem #2, a model must find the probability of a specific colored octagon configuration. While SFT models can solve standard counting problems (Medium tier), they fail this 'Extremely Hard' problem (0% accuracy) because they rigidly apply the inclusion-exclusion principle—a learned pattern—instead of the simpler, necessary casework approach.

Key Novelty

The Reasoning Ladder Analysis

Categorizes math problems into four tiers (Easy, Medium, Hard, Extremely Hard) based on empirical model performance rather than human estimation.
Demonstrates that 'Medium' proficiency is largely a result of adopting the 'R1-style' reasoning format (long CoT with reflection), requiring as few as 500 samples.
Identifies that 'Extremely Hard' problems require out-of-distribution strategies that cannot be learned through standard SFT scaling, unlike 'Hard' problems which scale logarithmically.

Evaluation Highlights

Fine-tuning on just 1K random R1-style trajectories improves Qwen2.5-32B's accuracy on Medium-level AIME24 questions from ~10% to ~90%.
On Hard-level questions, accuracy plateaus at ~65% despite logarithmic scaling of dataset size (up to 20K samples).
Current SFT models achieve 0% accuracy on Extremely Hard (Exh) level questions regardless of dataset size or curation.

Breakthrough Assessment

7/10

Provides a crucial, granular analysis of *why* SFT works (style transfer vs. reasoning) and defines the current ceiling (Exh problems). It refutes the 'small data is all you need' hype for harder problems.

⚙️ Technical Details

Problem Definition

Setting: Mathematical problem solving using Chain-of-Thought (CoT) generation

Inputs: Math problem statement q from AIME24

Outputs: Reasoning trace (CoT) and final answer a

Pipeline Flow

Input Math Problem
SFT Model (Qwen2.5-32B)
Generate Long CoT (R1-style)
Extract Final Answer

System Modules

SFT Model

Generate reasoning trace and answer

Model or implementation: Qwen2.5-32B-Instruct fine-tuned on OpenR1 data

Modeling

Base Model: Qwen2.5-32B-Instruct

Training Method: Supervised Fine-Tuning (SFT)

Training Data:

Subsets of OpenR1-Math-220k (filtered for correct solutions)
Ablation datasets: 500-1K samples, specific categories (algebra, geometry), specific styles (R1 vs Gemini)

Key Hyperparameters:

learning_rate: 1e-5
weight_decay: 1e-4
batch_size: 32
+ 2 more
epochs: 5
sft_dataset_size: Varied (100 to 20K trajectories)

Compute: Not reported in the paper

Comparison to Prior Work

vs. LIMO/S1.1: This paper conducts a granular difficulty analysis, revealing that small-scale SFT success is limited to Medium-tier problems and fails on Exh-tier.
vs. DeepSeek-R1: DeepSeek-R1 solves Exh problems via large-scale RL; SFT models (even with curated data) cannot match this.
vs. Openthinker: Demonstrates that simple SFT scaling plateaus on Hard problems compared to models using RL or tools.

Limitations

Evaluation is primarily restricted to the AIME24 dataset, though it is diverse and hierarchical.
The 'Extremely Hard' tier contains a small number of questions, limiting statistical power for that specific slice.
Does not explore Reinforcement Learning (RL) methods, focusing solely on the limits of SFT.

Reproducibility

Code: https://github.com/sunblaze-ucb/reasoning_ladder.git

Code is publicly available at https://github.com/sunblaze-ucb/reasoning_ladder.git. The study uses open datasets (AIME24, OpenR1-Math-220k) and open models (Qwen2.5-32B-Instruct), facilitating reproduction.

📊 Experiments & Results

Evaluation Setup

Greedy and sampling-based generation (n=8) on math competition problems.

Benchmarks:

AIME 2024 (High-school competition mathematics)

Metrics:

avg@8 (Average pass rate over 8 attempts)
cov@8 (Coverage: success in at least 1 of 8 attempts)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Medium-level questions shows that style transfer via minimal SFT (500-1K samples) is sufficient to reach near-perfect accuracy.
AIME 2024 (Medium Tier)	avg@8	10.0	90.0	+80.0
On Hard-level questions, data scaling follows a logarithmic law but plateaus, and curation offers only minor benefits.
AIME 2024 (Hard Tier)	avg@8	28.4	33.6	+5.2
AIME 2024 (Hard Tier)	avg@8	28.4	35.4	+7.0
Extremely Hard (Exh) questions represent a fundamental barrier for SFT models.
AIME 2024 (Exh Tier)	avg@8	0.0	0.0	0.0

Experiment Figures

Heatmap of model performance across AIME24 questions sorted by difficulty, showing a clear 'ladder' structure.

Scaling curves of accuracy vs. SFT dataset size for Hard-level questions.

Main Takeaways

Progression from Easy to Medium is a 'sudden leap' driven by adopting the R1 reasoning style (long context, verification), requiring minimal data (500-1K samples).
Progression from Medium to Hard is gradual and logarithmic; the bottleneck is 'stability' in long reasoning chains, which SFT improves but plateaus at ~65%.
Extremely Hard problems are not solvable via SFT scaling; they require unconventional strategies or geometric intuition that current SFT trajectories do not effectively transfer.
Careful data curation (finding similar problems) yields marginal gains compared to simply increasing dataset size for Hard problems.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Supervised Fine-Tuning (SFT) for LLMs
Familiarity with Chain-of-Thought (CoT) prompting
Basic knowledge of math competition levels (AIME)

Key Terms

SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, specific dataset to adapt its behavior

CoT: Chain-of-Thought—a prompting or training method where the model generates intermediate reasoning steps before the final answer

R1-style: Reasoning traces featuring extended chain-of-thought with self-reflection mechanisms, characterized by substantial length and explicit verification steps

AIME24: American Invitational Mathematics Examination 2024—a high-difficulty math competition dataset used as the primary testbed

avg@n: The average pass rate obtained by generating n solutions (with temperature=1) and averaging the binary success outcomes

cov@n: Coverage at n—indicates whether the model succeeds in at least one of the n attempts

OpenR1-Math-220k: A dataset of math problems paired with reasoning traces generated by DeepSeek-R1

Exh: Extremely Hard—the highest difficulty tier of AIME24 problems identified in this paper, on which SFT models typically achieve 0% accuracy