← Back to Paper List

Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT?

Yiyou Sun, Georgia Zhou, Haoyue Bai, Hao Wang, Dacheng Li, Nouha Dziri, Dawn Song
University of California, Berkeley, University of Wisconsin, Madison, Allen Institute for AI
arXiv (2025)
Reasoning Benchmark RL

📝 Paper Summary

Mathematical Reasoning Supervised Fine-Tuning (SFT) Model Capabilities Analysis
By categorizing AIME24 problems into difficulty tiers, the authors find that small-scale SFT solves Medium problems via style transfer but hits a hard ceiling on Extremely Hard problems requiring novel geometric or combinatorial intuition.
Core Problem
While small-scale SFT improves math reasoning, it is unclear whether these gains represent true generalization or overfitting, and what specific types of problems remain unsolvable regardless of data scaling.
Why it matters:
  • Recent claims suggest small datasets (~1K samples) are sufficient for reasoning, but the limits of this efficiency are unknown.
  • Understanding which improvements stem from style adoption versus actual reasoning capability is critical for advancing beyond current plateaus.
  • Identifying specific failure modes (e.g., computational instability vs. lack of intuition) guides the design of future training curricula.
Concrete Example: In AIME 2024 Problem #2, a model must find the probability of a specific colored octagon configuration. While SFT models can solve standard counting problems (Medium tier), they fail this 'Extremely Hard' problem (0% accuracy) because they rigidly apply the inclusion-exclusion principle—a learned pattern—instead of the simpler, necessary casework approach.
Key Novelty
The Reasoning Ladder Analysis
  • Categorizes math problems into four tiers (Easy, Medium, Hard, Extremely Hard) based on empirical model performance rather than human estimation.
  • Demonstrates that 'Medium' proficiency is largely a result of adopting the 'R1-style' reasoning format (long CoT with reflection), requiring as few as 500 samples.
  • Identifies that 'Extremely Hard' problems require out-of-distribution strategies that cannot be learned through standard SFT scaling, unlike 'Hard' problems which scale logarithmically.
Evaluation Highlights
  • Fine-tuning on just 1K random R1-style trajectories improves Qwen2.5-32B's accuracy on Medium-level AIME24 questions from ~10% to ~90%.
  • On Hard-level questions, accuracy plateaus at ~65% despite logarithmic scaling of dataset size (up to 20K samples).
  • Current SFT models achieve 0% accuracy on Extremely Hard (Exh) level questions regardless of dataset size or curation.
Breakthrough Assessment
7/10
Provides a crucial, granular analysis of *why* SFT works (style transfer vs. reasoning) and defines the current ceiling (Exh problems). It refutes the 'small data is all you need' hype for harder problems.
×