Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives

📝 Paper Summary

Supervised Fine-Tuning (SFT) Loss Function Design Data Quality and Noise Robustness

DEFT introduces a dynamic token-level loss function that transitions from standard NLL (for uncertain tokens) to linear probability loss (for confident tokens) to balance learning new knowledge with sharpening known concepts.

Core Problem

Standard NLL loss treats all tokens equally, causing two failures: it overfits to noise/conflicts when the model is uncertain (high gradients on bad data) and learns inefficiently when confident (gradients decay too fast).

Why it matters:

SFT data often contains 'confident conflicts' where the pretrained model's correct prior clashes with noisy targets, leading to catastrophic forgetting
Uniform weighting fails to sharpen the model's distribution on high-quality data once the model becomes confident, leading to diminishing returns
Existing methods that reweight based on confidence either suppress learning of hard positives (too much filtering) or fail to sharpen effectively

Concrete Example: In a dataset with noisy labels, a pretrained model might be confident and correct about a token, but the target is wrong. NLL assigns maximum gradient to this 'error', forcing the model to unlearn its correct prior to fit the noise. Conversely, for easy tokens where the model is already 90% confident, NLL gradients vanish too quickly, preventing the model from pushing confidence to 99%.

Key Novelty

Dynamic Entropy Fine-Tuning (DEFT)

Unifies SFT losses into a 'deformed-log' family where a single parameter (alpha) controls the 'trust gate'—how much the gradient scales with model confidence
Uses the Cayley transform to derive a theoretically optimal trajectory for alpha, ensuring it starts at 0 (NLL-like coverage) when uncertain and moves to 1 (linear sharpening) when confident
Implements this trajectory via a parameter-free objective that modulates gradients based on the model's predictive entropy (Rényi-2 entropy)

Evaluation Highlights

Consistent gains across 7 model backbones and multiple domains, with DEFT outperforming NLL in nearly all cases
Significant improvements in 'Model-Strong' regimes (high prior knowledge) by preventing noise overfitting, while maintaining performance in 'Model-Weak' regimes (new knowledge)
Reduces catastrophic forgetting of pretrained priors compared to standard NLL, as evidenced by lower forgetting rates in token-level analysis

Breakthrough Assessment

7/10

Strong theoretical grounding connecting loss functions to generalized entropy. Offers a parameter-free, drop-in replacement for Cross-Entropy that addresses a fundamental SFT tension (plasticity-stability). Empirical gains are consistent.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) of Large Language Models via next-token prediction

Inputs: Context sequence c (input tokens)

Outputs: Target token probability distribution p

Pipeline Flow

Forward Pass (Compute Logits)
Compute Entropy Proxy (Rényi-2)
Map Entropy to Focus Index (alpha)
Compute Deformed-Log Loss (DEFT)
Backward Pass (Update Weights)

System Modules

Entropy Calculator (Loss Computation)

Estimate model uncertainty for the current token

Model or implementation: Mathematical function (Rényi-2 entropy)

Dynamic Gating Mechanism (Loss Computation)

Modulate the gradient magnitude based on entropy

Model or implementation: Cayley Transform mapping

Novel Architectural Elements

State-dependent loss function: The objective function itself changes dynamically per-token based on the model's real-time predictive entropy

Modeling

Base Model: Evaluated on 7 backbones including Llama-3-8B-Instruct, Qwen2.5, Mistral, Gemma

Training Method: Supervised Fine-Tuning (SFT) with dynamic loss

Objective Functions:

Purpose: Standard SFT baseline.

Formally: NLL(p) = -log(p)
Purpose: Proposed dynamic objective.

Formally: DEFT loss integrates a dynamic focus index alpha(p) derived from the Cayley transform of uncertainty, approximating (1 - p^alpha) / alpha

Adaptation: Full fine-tuning (implied by context of SFT research, though LoRA is compatible)

Key Hyperparameters:

alpha: Dynamic (derived from predictive state, no fixed hyperparameter tuning needed for the main trajectory)
learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper

Compute: Negligible overhead over standard NLL (simple scalar computations per token)

Comparison to Prior Work

vs. Focal Loss: DEFT is dynamic rather than static; Focal Loss focuses on hard examples, while DEFT balances coverage (hard) and sharpening (easy) dynamically
vs. GCE: GCE uses a fixed q (alpha). DEFT makes alpha state-dependent, transitioning from GCE-like behavior to NLL-like behavior automatically
vs. Entropy Adaptive Fine-tuning [diao2026]: Diao et al. use entropy to mitigate overfitting but don't address inefficient sharpening. DEFT handles both via the trust gate
+ 1 more
vs. Rho-Loss [not cited in paper]: Rho-Loss selects data based on irreducible loss. DEFT reweights all tokens rather than selecting, preserving 'coverage' of hard positives

Limitations

Computational overhead of calculating entropy per token (though claimed small, it requires full softmax)
Relies on the assumption that low-entropy predictions correspond to 'correct' knowledge (true for well-pretrained models, less so for weak ones)
No large-scale human evaluation or RLHF stage analysis reported

Reproducibility

Code: https://github.com/luludus/DEFT

Code is publicly available at https://github.com/luludus/DEFT. The derivation of the Cayley transform and the resulting loss function is fully detailed mathematically in the paper.

📊 Experiments & Results

Evaluation Setup

SFT on diverse domains, partitioning data into Model-Strong, Model-Intermediate, and Model-Weak regimes based on prior knowledge.

Benchmarks:

General Instruction Following (Chat/Instruction)
Mathematical Reasoning (Math QA)
Code Generation (Coding)

Metrics:

Win Rate (implied / standard for instruction tuning)
Accuracy (for Math/Reasoning tasks)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall Performance (Average across domains)	Relative Improvement	0.0	Positive	Positive
Analysis of training regimes shows DEFT's specific strengths.
Model-Strong Regime	Performance	Lower	Higher	Significant
Model-Weak Regime	Performance	Lower	Higher	Significant

Experiment Figures

Scatter plot of token-level learning conflicts and a histogram of forgetting rates.

Main Takeaways

DEFT successfully unifies coverage (learning new things) and sharpening (refining known things) in a single objective without manual hyperparameter tuning.
The 'Trust Gate' mechanism effectively down-weights gradients from 'confident conflicts' (noisy labels where the model is right), preventing catastrophic forgetting.
The Cayley transform provides a geometric justification for the dynamic schedule, moving beyond heuristic schedules used in prior work.
Consistent improvements across varying model sizes (Strong/Weak regimes) suggest the method is robust to base model capability.

📚 Prerequisite Knowledge

Prerequisites

Standard Cross-Entropy Loss / Negative Log-Likelihood (NLL)
Gradient descent dynamics in LLMs
Concepts of entropy (Shannon vs. Tsallis/Rényi)

Key Terms

NLL: Negative Log-Likelihood—the standard loss function for training language models, equivalent to Cross-Entropy

Trust Gate: A term in the gradient equation that scales the update magnitude based on the model's current confidence

Focus Index (alpha): A parameter in the deformed-log loss family; alpha=0 recovers NLL, alpha=1 recovers linear probability loss

Cayley Transform: A conformal mapping used here to smoothly interpolate the focus index from 0 to 1 based on the model's uncertainty radius

Rényi-2 Entropy: A measure of distribution concentration (collision entropy) used as a proxy for the model's predictive state

Tsallis Entropy: A generalized entropy family that includes Shannon entropy as a limit case; used to define the optimization geometry

Confident Conflicts: Situations where the pretrained model is confident in a prediction that disagrees with the SFT target (often due to noise in SFT data)

Sharpening: The process of pushing a model's output probabilities from 'mostly correct' (e.g., 0.8) to 'highly certain' (e.g., 0.99)

Deformed-log family: A generalization of the logarithm function parameterized by q (or alpha), allowing for tunable sensitivity to probabilities