← Back to Paper List

Surgical Post-Training: Cutting Errors, Keeping Knowledge

Wenye Lin, Kai Han
arXiv (2026)
Reasoning RL

📝 Paper Summary

LLM Post-training Mathematical Reasoning
SPoT improves reasoning by training on surgically corrected model errors using a binary classification objective that implicitly regularizes the model to prevent catastrophic forgetting.
Core Problem
Fine-tuning on new data often causes catastrophic forgetting of prior knowledge, while reinforcement learning is computationally expensive and limited by the model's ability to sample correct answers.
Why it matters:
  • Supervised Fine-Tuning (SFT) destroys pre-trained capabilities due to distribution shift, making models worse at general tasks while learning specific ones
  • On-policy Reinforcement Learning (RL) is inefficient for hard tasks where the model rarely samples the correct solution naturally
  • Standard SFT on positive data inadvertently increases the probability of incorrect answers similar to the target (the 'pull-up' effect)
Concrete Example: When a model generates a mostly correct math solution with one logical flaw, standard SFT pushes the probability of the corrected sequence to 1. This unbounded optimization overwrites pre-trained features. Meanwhile, maximizing the likelihood of the correct sequence also inadvertently raises the probability of the original flawed sequence because they share most tokens, preventing the model from learning a sharp distinction.
Key Novelty
Surgical Post-Training (SPoT)
  • Uses an Oracle to minimally edit ('surgically correct') the model's own errors, creating training data that is topologically close to the model's existing distribution
  • Identifies that reward-based objectives act as an 'Elastic Tether': as the model learns a sample, the gradient scaling coefficient vanishes (due to sigmoid saturation), automatically stopping updates to preserve prior knowledge
  • Replaces DPO's relative ranking with a binary classification objective (BCE) to provide denser supervision signals suitable for rigid reasoning tasks
Evaluation Highlights
  • Improves Qwen3-8B's accuracy by 6.2% on average across in-domain and out-of-domain tasks compared to baselines
  • Achieves these gains with only 4k rectified math data pairs
  • Extremely efficient training: requires merely 28 minutes on 8x H800 GPUs
Breakthrough Assessment
9/10
Identifies a fundamental theoretical mechanism ('Elastic Tether') explaining why DPO prevents forgetting where SFT fails, and proposes a highly efficient method (28 mins) that leverages this insight.
×