← Back to Paper List

Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives

Zecheng Wang, Deyuan Liu, Chunshan Li, Yupeng Zhang, Zhengyun Zhao, Dianhui Chu, Bingning Wang, Dianbo Sui
Harbin Institute of Technology, WeChat, Tencent, Tsinghua University
arXiv (2026)
Reasoning Factuality RL

📝 Paper Summary

Supervised Fine-Tuning (SFT) Loss Function Design Data Quality and Noise Robustness
DEFT introduces a dynamic token-level loss function that transitions from standard NLL (for uncertain tokens) to linear probability loss (for confident tokens) to balance learning new knowledge with sharpening known concepts.
Core Problem
Standard NLL loss treats all tokens equally, causing two failures: it overfits to noise/conflicts when the model is uncertain (high gradients on bad data) and learns inefficiently when confident (gradients decay too fast).
Why it matters:
  • SFT data often contains 'confident conflicts' where the pretrained model's correct prior clashes with noisy targets, leading to catastrophic forgetting
  • Uniform weighting fails to sharpen the model's distribution on high-quality data once the model becomes confident, leading to diminishing returns
  • Existing methods that reweight based on confidence either suppress learning of hard positives (too much filtering) or fail to sharpen effectively
Concrete Example: In a dataset with noisy labels, a pretrained model might be confident and correct about a token, but the target is wrong. NLL assigns maximum gradient to this 'error', forcing the model to unlearn its correct prior to fit the noise. Conversely, for easy tokens where the model is already 90% confident, NLL gradients vanish too quickly, preventing the model from pushing confidence to 99%.
Key Novelty
Dynamic Entropy Fine-Tuning (DEFT)
  • Unifies SFT losses into a 'deformed-log' family where a single parameter (alpha) controls the 'trust gate'—how much the gradient scales with model confidence
  • Uses the Cayley transform to derive a theoretically optimal trajectory for alpha, ensuring it starts at 0 (NLL-like coverage) when uncertain and moves to 1 (linear sharpening) when confident
  • Implements this trajectory via a parameter-free objective that modulates gradients based on the model's predictive entropy (Rényi-2 entropy)
Evaluation Highlights
  • Consistent gains across 7 model backbones and multiple domains, with DEFT outperforming NLL in nearly all cases
  • Significant improvements in 'Model-Strong' regimes (high prior knowledge) by preventing noise overfitting, while maintaining performance in 'Model-Weak' regimes (new knowledge)
  • Reduces catastrophic forgetting of pretrained priors compared to standard NLL, as evidenced by lower forgetting rates in token-level analysis
Breakthrough Assessment
7/10
Strong theoretical grounding connecting loss functions to generalized entropy. Offers a parameter-free, drop-in replacement for Cross-Entropy that addresses a fundamental SFT tension (plasticity-stability). Empirical gains are consistent.
×